Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR
Weiqing Wang, Taejin Park, Ivan Medennikov, Jinhan Wang, Kunal Dhawan, He Huang, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg

TL;DR
This paper introduces a self-speaker adaptation method for multi-talker streaming ASR that adapts to individual speakers without explicit enrollment, improving recognition accuracy in overlapped speech scenarios.
Contribution
The paper presents a novel self-speaker adaptation technique that dynamically generates speaker-specific kernels using speaker supervision, eliminating the need for explicit speaker queries.
Findings
Achieves state-of-the-art performance in offline and streaming multi-talker ASR.
Effectively handles fully overlapped speech with instantaneous speaker adaptation.
Demonstrates robustness in severe overlapping speech conditions.
Abstract
We propose a self-speaker adaptation method for streaming multi-talker automatic speech recognition (ASR) that eliminates the need for explicit speaker queries. Unlike conventional approaches requiring target speaker embeddings or enrollment audio, our technique dynamically adapts individual ASR instances through speaker-wise speech activity prediction. The key innovation involves injecting speaker-specific kernels generated via speaker supervision activations into selected ASR encoder layers. This enables instantaneous speaker adaptation to target speakers while handling fully overlapped speech even in a streaming scenario. Experiments show state-of-the-art performance in both offline and streaming scenarios, demonstrating that our self-adaptive method effectively addresses severe speech overlap through streamlined speaker-focused recognition. The results validate the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
