TL;DR
Speaker-Reasoner is an end-to-end Speech LLM that improves multi-speaker conversation transcription by iterative analysis, temporal reasoning, and speaker-aware caching, outperforming baselines on challenging datasets.
Contribution
It introduces a novel multi-turn reasoning approach with a speaker-aware cache for better multi-speaker ASR, addressing overlapping speech and turn-taking challenges.
Findings
Achieves consistent improvements on AliMeeting and AISHELL-4 datasets.
Effectively handles overlapping speech and rapid turn-taking.
Outperforms strong baselines in multi-speaker transcription accuracy.
Abstract
Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn-taking, and context window constraints. We propose Speaker-Reasoner, an end-to-end Speech LLM with agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. Trained with a three-stage progressive strategy, Speaker-Reasoner achieves consistent improvements over strong baselines on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
