ASR-Synchronized Speaker-Role Diarization

Arindam Ghosh; Mark Fuhs; Bongjun Kim; Anurag Chowdhury; Monika Woszczyna

arXiv:2507.17765·eess.AS·December 23, 2025

ASR-Synchronized Speaker-Role Diarization

Arindam Ghosh, Mark Fuhs, Bongjun Kim, Anurag Chowdhury, Monika Woszczyna

PDF

Open Access

TL;DR

This paper introduces a novel ASR-synchronized speaker-role diarization method that improves role-based word diarization accuracy by adapting joint ASR+SD frameworks with task-specific models and features.

Contribution

It proposes a new approach for ASR+RD that uses task-specific predictors, higher-layer features, and a different loss function to enhance role diarization performance.

Findings

01

Achieved 6.2% and 4.5% reductions in R-WDER on two datasets.

02

Outperformed existing baseline methods.

03

Demonstrated the importance of task-specific modeling for RD.

Abstract

Speaker-role diarization (RD), such as doctor vs. patient or lawyer vs. client, is practically often more useful than conventional speaker diarization (SD), which assigns only generic labels (speaker-1, speaker-2). The state-of-the-art end-to-end ASR+RD approach uses a single transducer that serializes word and role predictions (role at the end of a speaker's turn), but at the cost of degraded ASR performance. To address this, we adapt a recent joint ASR+SD framework to ASR+RD by freezing the ASR transducer and training an auxiliary RD transducer in parallel to assign a role to each ASR-predicted word. For this, we first show that SD and RD are fundamentally different tasks, exhibiting different dependencies on acoustic and linguistic information. Motivated by this, we propose (1) task-specific predictor networks and (2) using higher-layer ASR encoder features as input to the RD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques