End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions
Anfeng Xu, Tiantian Feng, Somer Bishop, Catherine Lord, Shrikanth Narayanan

TL;DR
This paper introduces a unified end-to-end model that jointly performs automatic speech recognition and speaker role diarization for child-adult interactions, improving accuracy and efficiency over traditional cascaded systems.
Contribution
It extends the Whisper architecture with novel joint modeling techniques, including serialized output training, diarization-guided silence suppression, and a state-machine-based decoding, enabling reliable speaker-attributed transcription.
Findings
Achieves lower multi-talker word error rates
Demonstrates competitive diarization accuracy
Outperforms cascaded baseline systems
Abstract
Accurate transcription and speaker diarization of child-adult spoken interactions are crucial for developmental and clinical research. However, manual annotation is time-consuming and challenging to scale. Existing automated systems typically rely on cascaded speaker diarization and speech recognition pipelines, which can lead to error propagation. This paper presents a unified end-to-end framework that extends the Whisper encoder-decoder architecture to jointly model ASR and child-adult speaker role diarization. The proposed approach integrates: (i) a serialized output training scheme that emits speaker tags and start/end timestamps, (ii) a lightweight frame-level diarization head that enhances speaker-discriminative encoder representations, (iii) diarization-guided silence suppression for improved temporal precision, and (iv) a state-machine-based forced decoding procedure that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Language Development and Disorders · Voice and Speech Disorders
