The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge

Ya Jiang; Ruoyu Wang; Jingxuan Zhang; Jun Du; Yi Han; Zihao Quan; Hang Chen; Yeran Yang; Kongzhi Zheng; Zhuo Chen; Yanhui Tu; Shutong Niu; Changfeng Xi; Mengzhi Wang; Zhongbin Wu; Jieru Chen; Henghui Zhi; Weiyi Shi; Shuhang Wu; Genshun Wan; Jia Pan; Jianqing Gao

arXiv:2603.01415·eess.AS·March 3, 2026

The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge

Ya Jiang, Ruoyu Wang, Jingxuan Zhang, Jun Du, Yi Han, Zihao Quan, Hang Chen, Yeran Yang, Kongzhi Zheng, Zhuo Chen, Yanhui Tu, Shutong Niu, Changfeng Xi, Mengzhi Wang, Zhongbin Wu, Jieru Chen, Henghui Zhi, Weiyi Shi, Shuhang Wu, Genshun Wan, Jia Pan, Jianqing Gao

PDF

Open Access

TL;DR

This paper presents a multimodal system combining audio and visual data to improve speech recognition and speaker clustering in complex, overlapping multi-conversation indoor scenarios, achieving state-of-the-art results.

Contribution

It introduces a novel multimodal cascaded system utilizing synchronized video and audio with advanced pretrained models for improved speech recognition and clustering in multi-party conversations.

Findings

01

Achieved a Speaker WER of 32.44% on the development set.

02

Reduced WER to 31.40% with output fusion techniques.

03

Attained a speaker clustering F1 score of 1.0 with zero-shot LLM-based methods.

Abstract

This report details our submission to the CHiME-9 MCoRec Challenge on recognizing and clustering multiple concurrent natural conversations within indoor social settings. Unlike conventional meetings centered on a single shared topic, this scenario contains multiple parallel dialogues--up to eight speakers across up to four simultaneous conversations--with a speech overlap rate exceeding 90%. To tackle this, we propose a multimodal cascaded system that leverages per-speaker visual streams extracted from synchronized 360 degree video together with single-channel audio. Our system improves three components of the pipeline by leveraging enhanced audio-visual pretrained models: Active Speaker Detection (ASD), Audio-Visual Target Speech Extraction (AVTSE), and Audio-Visual Speech Recognition (AVSR). The AVSR module further incorporates Whisper and LLM techniques to boost transcription…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Video Analysis and Summarization