The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge
Ya Jiang, Ruoyu Wang, Jingxuan Zhang, Jun Du, Yi Han, Zihao Quan, Hang Chen, Yeran Yang, Kongzhi Zheng, Zhuo Chen, Yanhui Tu, Shutong Niu, Changfeng Xi, Mengzhi Wang, Zhongbin Wu, Jieru Chen, Henghui Zhi, Weiyi Shi, Shuhang Wu, Genshun Wan, Jia Pan, Jianqing Gao

TL;DR
This paper presents a multimodal system combining audio and visual data to improve speech recognition and speaker clustering in complex, overlapping multi-conversation indoor scenarios, achieving state-of-the-art results.
Contribution
It introduces a novel multimodal cascaded system utilizing synchronized video and audio with advanced pretrained models for improved speech recognition and clustering in multi-party conversations.
Findings
Achieved a Speaker WER of 32.44% on the development set.
Reduced WER to 31.40% with output fusion techniques.
Attained a speaker clustering F1 score of 1.0 with zero-shot LLM-based methods.
Abstract
This report details our submission to the CHiME-9 MCoRec Challenge on recognizing and clustering multiple concurrent natural conversations within indoor social settings. Unlike conventional meetings centered on a single shared topic, this scenario contains multiple parallel dialogues--up to eight speakers across up to four simultaneous conversations--with a speech overlap rate exceeding 90%. To tackle this, we propose a multimodal cascaded system that leverages per-speaker visual streams extracted from synchronized 360 degree video together with single-channel audio. Our system improves three components of the pipeline by leveraging enhanced audio-visual pretrained models: Active Speaker Detection (ASD), Audio-Visual Target Speech Extraction (AVTSE), and Audio-Visual Speech Recognition (AVSR). The AVSR module further incorporates Whisper and LLM techniques to boost transcription…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Video Analysis and Summarization
