The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC   2024

He Wang; Lei Xie

arXiv:2408.02369·cs.CV·September 13, 2024

The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024

He Wang, Lei Xie

PDF

Open Access 1 Repo

TL;DR

This paper describes the NPU-ASLP visual speech recognition system for CNVSRC 2024, utilizing advanced data augmentation and an end-to-end model architecture to achieve top results in multiple VSR challenge tracks.

Contribution

Introduction of an end-to-end VSR system with novel architecture components and extensive data augmentation, achieving state-of-the-art performance in CNVSRC 2024.

Findings

01

Achieved 30.47% CER in Single-Speaker VSR

02

Secured second place in open track of Single-Speaker Task

03

First place in three other tracks

Abstract

This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP (Team 237) in the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), engaging in all four tracks, including the fixed and open tracks of Single-Speaker VSR Task and Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multiscale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, introducing Enhanced ResNet3D visual frontend, E-Branchformer encoder, and Bi-directional Transformer decoder. Our approach yields a 30.47% CER for the Single-Speaker Task and 34.30% CER for the Multi-Speaker Task, securing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gitlab.com/csltstu/sunine
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsAttention Is All You Need · Linear Layer · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections