Integrity Score 105
No Records Found
No Records Found
😯
👏
Lipreading is the task of decoding text from the movement of a speaker’s mouth. We present LipWiz, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end.
Our model operates at the character-level, using spatiotemporal convolutional neural networks (STCNNs), recurrent neural networks (RNNs), and the connectionist temporal classification loss (CTC). We also compare the performance of LipWiz with that of hearing-impaired people who can lipread on the GRID corpus task. On average, We achieve an accuracy of 92.3%
We proposed LipWiz, the first model to apply deep learning to end-to-end learning of a model that maps sequences of image frames of a speaker’s mouth to entire sentences. The end-to-end model eliminates the need to segment videos into words before predicting a sentence. LipWiz requires neither rthand-engineered spatiotemporal visual features nor a separately trained sequence model.