Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge

Shangkun Huang; Yuxuan Du; Jingwen Yang; Dejun Zhang; Xupeng Jia; Jing Deng; Jintao Kang; Rong Zheng

arXiv:2505.22013·cs.SD·May 29, 2025

Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge

Shangkun Huang, Yuxuan Du, Jingwen Yang, Dejun Zhang, Xupeng Jia, Jing Deng, Jintao Kang, Rong Zheng

PDF

Open Access

TL;DR

This paper introduces a hybrid speaker diarization system with adaptive overlap handling and an ASR-aware observation addition method, achieving top results in the MISP 2025 Challenge for real-world meeting scenarios.

Contribution

It presents a novel hybrid diarization approach and an ASR-aware observation addition technique, improving performance in overlapping speech and noisy conditions.

Findings

01

Achieved 9.48% CER on Track 2

02

Secured 11.56% cpCER on Track 3

03

Won first place in both challenge tracks

Abstract

This paper presents the system developed to address the MISP 2025 Challenge. For the diarization system, we proposed a hybrid approach combining a WavLM end-to-end segmentation method with a traditional multi-module clustering technique to adaptively select the appropriate model for handling varying degrees of overlapping speech. For the automatic speech recognition (ASR) system, we proposed an ASR-aware observation addition method that compensates for the performance limitations of Guided Source Separation (GSS) under low signal-to-noise ratio conditions. Finally, we integrated the speaker diarization and ASR systems in a cascaded architecture to address Track 3. Our system achieved character error rates (CER) of 9.48% on Track 2 and concatenated minimum permutation character error rate (cpCER) of 11.56% on Track 3, ultimately securing first place in both tracks and thereby…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research