Target Speaker Lipreading by Audio-Visual Self-Distillation Pretraining   and Speaker Adaptation

Jing-Xuan Zhang; Tingzhi Mao; Longjiang Guo; Jin Li; Lichen Zhang

arXiv:2502.05758·eess.AS·February 11, 2025·Expert Syst. Appl.

Target Speaker Lipreading by Audio-Visual Self-Distillation Pretraining and Speaker Adaptation

Jing-Xuan Zhang, Tingzhi Mao, Longjiang Guo, Jin Li, Lichen Zhang

PDF

Open Access

TL;DR

This paper introduces a comprehensive approach combining cross-lingual transfer learning, speaker adaptation, and model ensembling to improve lipreading accuracy across languages and speakers, achieving state-of-the-art results in Chinese lipreading.

Contribution

It proposes novel methods for cross-lingual transfer, speaker adaptation, and model ensembling to enhance lipreading performance in multilingual and speaker-specific contexts.

Findings

01

Achieved CER of 77.3% on ChatCLR dataset

02

Outperformed top results in 2024 Chat-scenario Chinese Lipreading Challenge

03

Demonstrated effectiveness of combining lip ROI and face inputs

Abstract

Lipreading is an important technique for facilitating human-computer interaction in noisy environments. Our previously developed self-supervised learning method, AV2vec, which leverages multimodal self-distillation, has demonstrated promising performance in speaker-independent lipreading on the English LRS3 dataset. However, AV2vec faces challenges such as high training costs and a potential scarcity of audio-visual data for lipreading in languages other than English, such as Chinese. Additionally, most studies concentrate on speakerindependent lipreading models, which struggle to account for the substantial variation in speaking styles across di?erent speakers. To address these issues, we propose a comprehensive approach. First, we investigate cross-lingual transfer learning, adapting a pre-trained AV2vec model from a source language and optimizing it for the lipreading task in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing

MethodsSparse Evolutionary Training