Joint ASR and Speaker Role Tagging with Serialized Output Training

Anfeng Xu; Tiantian Feng; Shrikanth Narayanan

arXiv:2506.10349·eess.AS·June 13, 2025

Joint ASR and Speaker Role Tagging with Serialized Output Training

Anfeng Xu, Tiantian Feng, Shrikanth Narayanan

PDF

Open Access

TL;DR

This paper introduces a method using serialized output training to enable a single model to perform both speech recognition and speaker role tagging simultaneously, improving accuracy in conversational AI tasks.

Contribution

The paper proposes augmenting Whisper with role-specific tokens and fine-tuning it with serialized output training for joint ASR and speaker role tagging.

Findings

01

Over 10% reduction in multi-talker WER

02

Effective role-aware transcriptions in a single decoding pass

03

Outperforms previous self-supervised baseline

Abstract

Automatic Speech Recognition systems have made significant progress with large-scale pre-trained models. However, most current systems focus solely on transcribing the speech without identifying speaker roles, a function that is critical for conversational AI. In this work, we investigate the use of serialized output training (SOT) for joint ASR and speaker role tagging. By augmenting Whisper with role-specific tokens and fine-tuning it with SOT, we enable the model to generate role-aware transcriptions in a single decoding pass. We compare the SOT approach against a self-supervised previous baseline method on two real-world conversational datasets. Our findings show that this approach achieves more than 10% reduction in multi-talker WER, demonstrating its feasibility as a unified model for speaker-role aware speech transcription.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsFocus · Attentive Walk-Aggregating Graph Neural Network