Lightweight Target-Speaker-Based Overlap Transcription for Practical Streaming ASR

Ale\v{s} Pra\v{z}\'ak; Marie Kune\v{s}ov\'a; Josef Psutka

arXiv:2506.20288·eess.AS·June 26, 2025

Lightweight Target-Speaker-Based Overlap Transcription for Practical Streaming ASR

Ale\v{s} Pra\v{z}\'ak, Marie Kune\v{s}ov\'a, Josef Psutka

PDF

Open Access

TL;DR

This paper introduces a lightweight, target-speaker-based extension to streaming ASR systems that effectively transcribes overlapping speech with minimal additional computational cost, improving accuracy in real-world multi-speaker scenarios.

Contribution

It presents a novel, low-overhead method combining overlap detection and speaker-conditioned transcription to handle overlapping speech in streaming ASR.

Findings

01

WER on overlapping segments reduced from 68.0% to 35.78%.

02

Overlap detection achieved accurate segmentation with negligible cost.

03

Total computational load increased by only 44%.

Abstract

Overlapping speech remains a major challenge for automatic speech recognition (ASR) in real-world applications, particularly in broadcast media with dynamic, multi-speaker interactions. We propose a light-weight, target-speaker-based extension to an existing streaming ASR system to enable practical transcription of overlapping speech with minimal computational overhead. Our approach combines a speaker-independent (SI) model for standard operation with a speaker-conditioned (SC) model selectively applied in overlapping scenarios. Overlap detection is achieved using a compact binary classifier trained on frozen SI model output, offering accurate segmentation at negligible cost. The SC model employs Feature-wise Linear Modulation (FiLM) to incorporate speaker embeddings and is trained on synthetically mixed data to transcribe only the target speaker. Our method supports dynamic speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsSparse Evolutionary Training