TellWhisper: Tell Whisper Who Speaks When
Yifan Hu, Peiji Yang, Zhisheng Wang, Yicheng Zhong, Rui Liu

TL;DR
TellWhisper introduces a unified speech encoder that jointly models speaker identity and timing, improving multi-speaker speech recognition especially during rapid turn-taking and overlaps.
Contribution
It proposes TS-RoPE for explicit time-speaker encoding and Hyper-SD for hyperbolic speaker activity estimation, addressing limitations of previous decoupled methods.
Findings
Significantly improves recognition accuracy in overlapping speech scenarios.
Effectively captures speaker turn transitions and state dynamics.
Enhances inter-class separation in speaker activity estimation.
Abstract
Multi-speaker automatic speech recognition (MASR) aims to predict ''who spoke when and what'' from multi-speaker speech, a key technology for multi-party dialogue understanding. However, most existing approaches decouple temporal modeling and speaker modeling when addressing ''when'' and ''who'': some inject speaker cues before encoding (e.g., speaker masking), which can cause irreversible information loss; others fuse identity by mixing speaker posteriors after encoding, which may entangle acoustic content with speaker identity. This separation is brittle under rapid turn-taking and overlapping speech, often leading to degraded performance. To address these limitations, we propose TellWhisper, a unified framework that jointly models speaker identity and temporal within the speech encoder. Specifically, we design TS-RoPE, a time-speaker rotary positional encoding: time coordinates are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
