When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse

Yihuan Huang; Jun Xue; Liu Jiajun; Daixian Li; Tong Zhang; Zhuolin Yi; Yanzhen Ren; Kai Li

arXiv:2603.22915·cs.CV·March 25, 2026

When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse

Yihuan Huang, Jun Xue, Liu Jiajun, Daixian Li, Tong Zhang, Zhuolin Yi, Yanzhen Ren, Kai Li

PDF

Open Access 1 Datasets

TL;DR

This paper evaluates the robustness of AVSR models in video conferencing, constructs a new dataset with Lombard effect, and uncovers how speech enhancement causes distribution shifts affecting model performance.

Contribution

It introduces MLD-VC, the first multimodal dataset for VC, and analyzes the impact of speech enhancement and Lombard effect on AVSR robustness.

Findings

01

Speech enhancement algorithms cause distribution shifts in audio features.

02

Lombard effect-induced shifts resemble those from speech enhancement, improving robustness.

03

Fine-tuning on MLD-VC reduces CER by 17.5% on average across VC platforms.

Abstract

Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct \textbf{MLD-VC}, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

nccm2p2/MLD-VC
dataset· 131 dl
131 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing