BUT System Description for CHiME-9 MCoRec Challenge

Dominik Klement; Alexander Polok; Nguyen Hai Phong; Prachi Singh; Luk\'a\v{s} Burget

arXiv:2604.27436·eess.AS·May 1, 2026

BUT System Description for CHiME-9 MCoRec Challenge

Dominik Klement, Alexander Polok, Nguyen Hai Phong, Prachi Singh, Luk\'a\v{s} Burget

PDF

TL;DR

This paper introduces the BUT system for multi-talker AV-ASR in overlapping conversations, combining a long-context target-speaker model with LLM-based clustering, achieving significant improvements in WER and clustering F1 scores.

Contribution

The work presents a novel AV-ASR architecture conditioned on visual cues and a clustering method using LLMs, advancing multi-talker transcription in complex scenarios.

Findings

01

Achieved 33.7% WER on MCoRec dev set, 16.2% better than baseline.

02

Attained a clustering F1 score of 0.97, surpassing previous methods.

03

Ranked second in the CHiME-9 MCoRec challenge.

Abstract

Multi-talker automatic speech recognition (ASR) in conversational recordings remains an open problem, particularly in scenarios with large portion of overlapping speech where identifying and transcribing a target speaker is difficult from audio alone. Visual cues can help resolve speaker ambiguity, yet their integration into long-context audio-visual (AV) ASR systems has been limited. The CHiME-9 MCoRec task addresses this challenge by requiring transcription of audio-visual recordings of heavily-overlapped parallel conversations, followed by clustering the participants into conversational groups. In this work, we present the BUT system based on a long-context target-speaker AV-ASR model capable of processing long-form recordings in a single decoding pass. Our architecture conditions a pre-trained NVIDIA Parakeet-v2 ASR model on visual representations from a pre-trained AV-HuBERT model.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.