Target Speaker Lipreading by Audio-Visual Self-Distillation Pretraining and Speaker Adaptation
Jing-Xuan Zhang, Tingzhi Mao, Longjiang Guo, Jin Li, Lichen Zhang

TL;DR
This paper introduces a comprehensive approach combining cross-lingual transfer learning, speaker adaptation, and model ensembling to improve lipreading accuracy across languages and speakers, achieving state-of-the-art results in Chinese lipreading.
Contribution
It proposes novel methods for cross-lingual transfer, speaker adaptation, and model ensembling to enhance lipreading performance in multilingual and speaker-specific contexts.
Findings
Achieved CER of 77.3% on ChatCLR dataset
Outperformed top results in 2024 Chat-scenario Chinese Lipreading Challenge
Demonstrated effectiveness of combining lip ROI and face inputs
Abstract
Lipreading is an important technique for facilitating human-computer interaction in noisy environments. Our previously developed self-supervised learning method, AV2vec, which leverages multimodal self-distillation, has demonstrated promising performance in speaker-independent lipreading on the English LRS3 dataset. However, AV2vec faces challenges such as high training costs and a potential scarcity of audio-visual data for lipreading in languages other than English, such as Chinese. Additionally, most studies concentrate on speakerindependent lipreading models, which struggle to account for the substantial variation in speaking styles across di?erent speakers. To address these issues, we propose a comprehensive approach. First, we investigate cross-lingual transfer learning, adapting a pre-trained AV2vec model from a source language and optimizing it for the lipreading task in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
MethodsSparse Evolutionary Training
