A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results

Thai-Binh Nguyen; Katerina Zmolikova; Pingchuan Ma; Ngoc Quan Pham; Christian Fuegen; Alexander Waibel

arXiv:2510.23276·cs.CL·February 13, 2026

A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results

Thai-Binh Nguyen, Katerina Zmolikova, Pingchuan Ma, Ngoc Quan Pham, Christian Fuegen, Alexander Waibel

PDF

TL;DR

This paper introduces a new multi-modal dataset and benchmark for recognizing and separating overlapping multi-party conversations using audio, visual, and contextual cues, advancing research in natural multi-party speech recognition.

Contribution

It presents the MCoRec task, a novel benchmark with a multi-modal dataset for multi-party conversation recognition in challenging overlapping scenarios, including baseline systems and evaluation results.

Findings

01

Visual cues significantly improve recognition accuracy.

02

Audio-only systems struggle with high speech overlap.

03

Multi-modal approaches outperform audio-only baselines.

Abstract

We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in the ninth CHiME Challenge, which addresses the cocktail-party problem of overlapping conversations in a single-room setting using audio, visual, and contextual cues. MCoRec captures natural multi-party conversations where the recordings focus on unscripted, casual group chats, leading to extreme speech overlap of up to 100% and highly fragmented conversational turns. The task requires systems to answer the question "Who speaks when, what, and with whom?" by jointly transcribing each speaker's speech and clustering them into their respective conversations from audio-visual recordings. Audio-only baselines exceed 100% word error rate, whereas incorporating visual cues yields substantial 50% improvements, highlighting the importance of multi-modality. In this manuscript, we present the motivation behind the task,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.