The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition

Ming Gao; Shilong Wu; Hang Chen; Jun Du; Chin-Hui Lee; Shinji Watanabe; Jingdong Chen; Siniscalchi Sabato Marco; Odette Scharenborg

arXiv:2505.13971·cs.SD·May 28, 2025

The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition

Ming Gao, Shilong Wu, Hang Chen, Jun Du, Chin-Hui Lee, Shinji Watanabe, Jingdong Chen, Siniscalchi Sabato Marco, Odette Scharenborg

PDF

Open Access

TL;DR

The MISP 2025 Challenge at Interspeech 2025 focused on advancing multi-modal meeting transcription by integrating video with audio, achieving significant improvements in diarization and recognition accuracy through innovative systems.

Contribution

This paper presents the objectives, datasets, baseline systems, and participant solutions for the first multi-modal meeting transcription challenge, highlighting new approaches and substantial performance gains.

Findings

01

Top AVSD system achieved DER of 8.09%, a 7.43% improvement.

02

Top AVSR system achieved CER of 9.48%, a 10.62% improvement.

03

Best AVDR system achieved cpCER of 11.56%, a 72.49% improvement.

Abstract

Meetings are a valuable yet challenging scenario for speech applications due to complex acoustic conditions. This paper summarizes the outcomes of the MISP 2025 Challenge, hosted at Interspeech 2025, which focuses on multi-modal, multi-device meeting transcription by incorporating video modality alongside audio. The tasks include Audio-Visual Speaker Diarization (AVSD), Audio-Visual Speech Recognition (AVSR), and Audio-Visual Diarization and Recognition (AVDR). We present the challenge's objectives, tasks, dataset, baseline systems, and solutions proposed by participants. The best-performing systems achieved significant improvements over the baseline: the top AVSD model achieved a Diarization Error Rate (DER) of 8.09%, improving by 7.43%; the top AVSR system achieved a Character Error Rate (CER) of 9.48%, improving by 10.62%; and the best AVDR system achieved a concatenated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis