End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions

Anfeng Xu; Tiantian Feng; Somer Bishop; Catherine Lord; Shrikanth Narayanan

arXiv:2601.17640·eess.AS·January 27, 2026

End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions

Anfeng Xu, Tiantian Feng, Somer Bishop, Catherine Lord, Shrikanth Narayanan

PDF

Open Access 1 Models

TL;DR

This paper introduces a unified end-to-end model that jointly performs automatic speech recognition and speaker role diarization for child-adult interactions, improving accuracy and efficiency over traditional cascaded systems.

Contribution

It extends the Whisper architecture with novel joint modeling techniques, including serialized output training, diarization-guided silence suppression, and a state-machine-based decoding, enabling reliable speaker-attributed transcription.

Findings

01

Achieves lower multi-talker word error rates

02

Demonstrates competitive diarization accuracy

03

Outperforms cascaded baseline systems

Abstract

Accurate transcription and speaker diarization of child-adult spoken interactions are crucial for developmental and clinical research. However, manual annotation is time-consuming and challenging to scale. Existing automated systems typically rely on cascaded speaker diarization and speech recognition pipelines, which can lead to error propagation. This paper presents a unified end-to-end framework that extends the Whisper encoder-decoder architecture to jointly model ASR and child-adult speaker role diarization. The proposed approach integrates: (i) a serialized output training scheme that emits speaker tags and start/end timestamps, (ii) a lightweight frame-level diarization head that enhances speaker-discriminative encoder representations, (iii) diarization-guided silence suppression for improved temporal precision, and (iv) a state-machine-based forced decoding procedure that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
AlexXu811/child-adult-joint-asr-diarization
model· 49 dl· ♡ 2
49 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Language Development and Disorders · Voice and Speech Disorders