Joint Learning Global-Local Speaker Classification to Enhance End-to-End Speaker Diarization and Recognition
Yuhang Dai, Haopeng Lin, Jiale Qian, Ruiqi Yan, Hao Meng, Hanke Xie, Hanlin Wen, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, Xinsheng Wang

TL;DR
This paper introduces GLSC-SDR, a joint training paradigm that enhances speaker discrimination in end-to-end diarization and recognition by using hierarchical global-local speaker classification.
Contribution
It proposes a novel hierarchical global-local speaker classification strategy integrated with large audio-language models to improve speaker discriminability.
Findings
Achieves superior performance on AliMeeting, AISHELL-4, and AMI-SDM datasets.
Enhances fine-grained speaker discrimination without large-scale real data.
Maintains semantic transcription accuracy while improving speaker recognition.
Abstract
Large Audio-Language Models (LALMs) have demonstrated remarkable performance in end-to-end speaker diarization and recognition. However, their speaker discriminability remains limited due to the scarcity of large-scale conversational data and the absence of explicit speaker representation optimization. To address this, we propose GLSC-SDR, a paradigm that jointly trains speaker classification with diarization and recognition. We further introduce a Global-Local Speaker Classification strategy, which uses clustered speakers as global labels and re-encoded intra-cluster speakers as local labels. This hierarchical design enhances fine-grained speaker discrimination while preserving semantic transcription accuracy. Experiments on AliMeeting, AISHELL-4, and AMI-SDM demonstrate that GLSC-SDR achieves competitive or superior performance compared to simulation-based and multi-encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
