TellWhisper: Tell Whisper Who Speaks When

Yifan Hu; Peiji Yang; Zhisheng Wang; Yicheng Zhong; Rui Liu

arXiv:2601.03712·eess.AS·April 15, 2026

TellWhisper: Tell Whisper Who Speaks When

Yifan Hu, Peiji Yang, Zhisheng Wang, Yicheng Zhong, Rui Liu

PDF

TL;DR

TellWhisper introduces a unified speech encoder that jointly models speaker identity and timing, improving multi-speaker speech recognition especially during rapid turn-taking and overlaps.

Contribution

It proposes TS-RoPE for explicit time-speaker encoding and Hyper-SD for hyperbolic speaker activity estimation, addressing limitations of previous decoupled methods.

Findings

01

Significantly improves recognition accuracy in overlapping speech scenarios.

02

Effectively captures speaker turn transitions and state dynamics.

03

Enhances inter-class separation in speaker activity estimation.

Abstract

Multi-speaker automatic speech recognition (MASR) aims to predict ''who spoke when and what'' from multi-speaker speech, a key technology for multi-party dialogue understanding. However, most existing approaches decouple temporal modeling and speaker modeling when addressing ''when'' and ''who'': some inject speaker cues before encoding (e.g., speaker masking), which can cause irreversible information loss; others fuse identity by mixing speaker posteriors after encoding, which may entangle acoustic content with speaker identity. This separation is brittle under rapid turn-taking and overlapping speech, often leading to degraded performance. To address these limitations, we propose TellWhisper, a unified framework that jointly models speaker identity and temporal within the speech encoder. Specifically, we design TS-RoPE, a time-speaker rotary positional encoding: time coordinates are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.