The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in   CNVSRC 2023

He Wang; Pengcheng Guo; Wei Chen; Pan Zhou; Lei Xie

arXiv:2401.06788·eess.AS·March 1, 2024·1 cites

The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

He Wang, Pengcheng Guo, Wei Chen, Pan Zhou, Lei Xie

PDF

Open Access 2 Repos

TL;DR

This paper presents a state-of-the-art visual speech recognition system developed by NPU-ASLP-LiAuto for CNVSRC 2023, utilizing advanced data augmentation and an end-to-end neural architecture to achieve top rankings in multiple VSR tasks.

Contribution

The paper introduces a novel VSR system with multi-scale video processing, extensive data augmentation, and an end-to-end model architecture that outperforms previous approaches in the CNVSRC 2023 challenge.

Findings

01

Achieved 34.76% CER in Single-Speaker VSR

02

Achieved 41.06% CER in Multi-Speaker VSR

03

Ranked first in all three participating tracks

Abstract

This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, comprising a ResNet3D visual frontend, an E-Branchformer encoder, and a Transformer decoder. Experiments show that our system achieves 34.76% CER for the Single-Speaker Task and 41.06% CER for the Multi-Speaker Task after…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Indoor and Outdoor Localization Technologies

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Layer Normalization · Dropout · Softmax · Adam · Residual Connection