A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results
Thai-Binh Nguyen, Katerina Zmolikova, Pingchuan Ma, Ngoc Quan Pham, Christian Fuegen, Alexander Waibel

TL;DR
This paper introduces a new multi-modal dataset and benchmark for recognizing and separating overlapping multi-party conversations using audio, visual, and contextual cues, advancing research in natural multi-party speech recognition.
Contribution
It presents the MCoRec task, a novel benchmark with a multi-modal dataset for multi-party conversation recognition in challenging overlapping scenarios, including baseline systems and evaluation results.
Findings
Visual cues significantly improve recognition accuracy.
Audio-only systems struggle with high speech overlap.
Multi-modal approaches outperform audio-only baselines.
Abstract
We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in the ninth CHiME Challenge, which addresses the cocktail-party problem of overlapping conversations in a single-room setting using audio, visual, and contextual cues. MCoRec captures natural multi-party conversations where the recordings focus on unscripted, casual group chats, leading to extreme speech overlap of up to 100% and highly fragmented conversational turns. The task requires systems to answer the question "Who speaks when, what, and with whom?" by jointly transcribing each speaker's speech and clustering them into their respective conversations from audio-visual recordings. Audio-only baselines exceed 100% word error rate, whereas incorporating visual cues yields substantial 50% improvements, highlighting the importance of multi-modality. In this manuscript, we present the motivation behind the task,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
