BUT System Description for CHiME-9 MCoRec Challenge
Dominik Klement, Alexander Polok, Nguyen Hai Phong, Prachi Singh, Luk\'a\v{s} Burget

TL;DR
This paper introduces the BUT system for multi-talker AV-ASR in overlapping conversations, combining a long-context target-speaker model with LLM-based clustering, achieving significant improvements in WER and clustering F1 scores.
Contribution
The work presents a novel AV-ASR architecture conditioned on visual cues and a clustering method using LLMs, advancing multi-talker transcription in complex scenarios.
Findings
Achieved 33.7% WER on MCoRec dev set, 16.2% better than baseline.
Attained a clustering F1 score of 0.97, surpassing previous methods.
Ranked second in the CHiME-9 MCoRec challenge.
Abstract
Multi-talker automatic speech recognition (ASR) in conversational recordings remains an open problem, particularly in scenarios with large portion of overlapping speech where identifying and transcribing a target speaker is difficult from audio alone. Visual cues can help resolve speaker ambiguity, yet their integration into long-context audio-visual (AV) ASR systems has been limited. The CHiME-9 MCoRec task addresses this challenge by requiring transcription of audio-visual recordings of heavily-overlapped parallel conversations, followed by clustering the participants into conversational groups. In this work, we present the BUT system based on a long-context target-speaker AV-ASR model capable of processing long-form recordings in a single decoding pass. Our architecture conditions a pre-trained NVIDIA Parakeet-v2 ASR model on visual representations from a pre-trained AV-HuBERT model.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
