The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024
He Wang, Lei Xie

TL;DR
This paper describes the NPU-ASLP visual speech recognition system for CNVSRC 2024, utilizing advanced data augmentation and an end-to-end model architecture to achieve top results in multiple VSR challenge tracks.
Contribution
Introduction of an end-to-end VSR system with novel architecture components and extensive data augmentation, achieving state-of-the-art performance in CNVSRC 2024.
Findings
Achieved 30.47% CER in Single-Speaker VSR
Secured second place in open track of Single-Speaker Task
First place in three other tracks
Abstract
This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP (Team 237) in the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), engaging in all four tracks, including the fixed and open tracks of Single-Speaker VSR Task and Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multiscale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, introducing Enhanced ResNet3D visual frontend, E-Branchformer encoder, and Bi-directional Transformer decoder. Our approach yields a 30.47% CER for the Single-Speaker Task and 34.30% CER for the Multi-Speaker Task, securing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections
