Lightweight Target-Speaker-Based Overlap Transcription for Practical Streaming ASR
Ale\v{s} Pra\v{z}\'ak, Marie Kune\v{s}ov\'a, Josef Psutka

TL;DR
This paper introduces a lightweight, target-speaker-based extension to streaming ASR systems that effectively transcribes overlapping speech with minimal additional computational cost, improving accuracy in real-world multi-speaker scenarios.
Contribution
It presents a novel, low-overhead method combining overlap detection and speaker-conditioned transcription to handle overlapping speech in streaming ASR.
Findings
WER on overlapping segments reduced from 68.0% to 35.78%.
Overlap detection achieved accurate segmentation with negligible cost.
Total computational load increased by only 44%.
Abstract
Overlapping speech remains a major challenge for automatic speech recognition (ASR) in real-world applications, particularly in broadcast media with dynamic, multi-speaker interactions. We propose a light-weight, target-speaker-based extension to an existing streaming ASR system to enable practical transcription of overlapping speech with minimal computational overhead. Our approach combines a speaker-independent (SI) model for standard operation with a speaker-conditioned (SC) model selectively applied in overlapping scenarios. Overlap detection is achieved using a compact binary classifier trained on frozen SI model output, offering accurate segmentation at negligible cost. The SC model employs Feature-wise Linear Modulation (FiLM) to incorporate speaker embeddings and is trained on synthetically mixed data to transcribe only the target speaker. Our method supports dynamic speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsSparse Evolutionary Training
