Cocktail-Party Audio-Visual Speech Recognition

Thai-Binh Nguyen; Ngoc-Quan Pham; Alexander Waibel

arXiv:2506.02178·cs.SD·June 4, 2025

Cocktail-Party Audio-Visual Speech Recognition

Thai-Binh Nguyen, Ngoc-Quan Pham, Alexander Waibel

PDF

Open Access 1 Models 2 Datasets

TL;DR

This paper introduces a large-scale audio-visual dataset and a robust AVSR method that significantly improves speech recognition accuracy in noisy cocktail-party scenarios, addressing real-world complexities often overlooked by prior models.

Contribution

The study provides a new extensive AVSR dataset with both talking and silent segments and demonstrates a method that reduces WER by 67% in noisy environments without explicit segmentation.

Findings

01

Reduced WER from 119% to 39.2% in extreme noise conditions.

02

Introduced a 1526-hour AVSR dataset with talking and silent segments.

03

Achieved significant performance gains in cocktail-party environments.

Abstract

Audio-Visual Speech Recognition (AVSR) offers a robust solution for speech recognition in challenging environments, such as cocktail-party scenarios, where relying solely on audio proves insufficient. However, current AVSR models are often optimized for idealized scenarios with consistently active speakers, overlooking the complexities of real-world settings that include both speaking and silent facial segments. This study addresses this gap by introducing a novel audio-visual cocktail-party dataset designed to benchmark current AVSR systems and highlight the limitations of prior approaches in realistic noisy conditions. Additionally, we contribute a 1526-hour AVSR dataset comprising both talking-face and silent-face segments, enabling significant performance gains in cocktail-party environments. Our approach reduces WER by 67% relative to the state-of-the-art, reducing WER from 119% to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nguyenvulebinh/AVSRCocktail
model· 203 dl
203 dl

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing