InfoSyncNet: Information Synchronization Temporal Convolutional Network for Visual Speech Recognition
Junxiao Xue, Xiaozhen Liu, Xuecheng Wu, Fei Yu, Jun Wang

TL;DR
This paper introduces InfoSyncNet, a novel temporal convolutional network with non-uniform sequence modeling and data augmentation, achieving state-of-the-art accuracy in visual speech recognition from silent videos.
Contribution
The paper presents a new model with a non-uniform quantization module and tailored training strategies for improved visual speech recognition.
Findings
Achieved 92.0% accuracy on LRW dataset.
Achieved 60.7% accuracy on LRW1000 dataset.
Outperformed existing methods in visual speech recognition.
Abstract
Estimating spoken content from silent videos is crucial for applications in Assistive Technology (AT) and Augmented Reality (AR). However, accurately mapping lip movement sequences in videos to words poses significant challenges due to variability across sequences and the uneven distribution of information within each sequence. To tackle this, we introduce InfoSyncNet, a non-uniform sequence modeling network enhanced by tailored data augmentation techniques. Central to InfoSyncNet is a non-uniform quantization module positioned between the encoder and decoder, enabling dynamic adjustment to the network's focus and effectively handling the natural inconsistencies in visual speech data. Additionally, multiple training strategies are incorporated to enhance the model's capability to handle variations in lighting and the speaker's orientation. Comprehensive experiments on the LRW and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Hearing Loss and Rehabilitation
