Joint Learning Global-Local Speaker Classification to Enhance End-to-End Speaker Diarization and Recognition

Yuhang Dai; Haopeng Lin; Jiale Qian; Ruiqi Yan; Hao Meng; Hanke Xie; Hanlin Wen; Shunshun Yin; Ming Tao; Xie Chen; Lei Xie; Xinsheng Wang

arXiv:2603.25377·cs.SD·March 30, 2026

Joint Learning Global-Local Speaker Classification to Enhance End-to-End Speaker Diarization and Recognition

Yuhang Dai, Haopeng Lin, Jiale Qian, Ruiqi Yan, Hao Meng, Hanke Xie, Hanlin Wen, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, Xinsheng Wang

PDF

TL;DR

This paper introduces GLSC-SDR, a joint training paradigm that enhances speaker discrimination in end-to-end diarization and recognition by using hierarchical global-local speaker classification.

Contribution

It proposes a novel hierarchical global-local speaker classification strategy integrated with large audio-language models to improve speaker discriminability.

Findings

01

Achieves superior performance on AliMeeting, AISHELL-4, and AMI-SDM datasets.

02

Enhances fine-grained speaker discrimination without large-scale real data.

03

Maintains semantic transcription accuracy while improving speaker recognition.

Abstract

Large Audio-Language Models (LALMs) have demonstrated remarkable performance in end-to-end speaker diarization and recognition. However, their speaker discriminability remains limited due to the scarcity of large-scale conversational data and the absence of explicit speaker representation optimization. To address this, we propose GLSC-SDR, a paradigm that jointly trains speaker classification with diarization and recognition. We further introduce a Global-Local Speaker Classification strategy, which uses clustered speakers as global labels and re-encoded intra-cluster speakers as local labels. This hierarchical design enhances fine-grained speaker discrimination while preserving semantic transcription accuracy. Experiments on AliMeeting, AISHELL-4, and AMI-SDM demonstrate that GLSC-SDR achieves competitive or superior performance compared to simulation-based and multi-encoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.