Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

Shuhai Peng; Hui Lu; Jinjiang Liu; Liyang Chen; Guiping Zhong; Jiakui Li; Huimeng Wang; Haiyun Li; Liang Cao; Shiyin Kang; Zhiyong Wu

arXiv:2604.19635·cs.SD·April 22, 2026

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

Shuhai Peng, Hui Lu, Jinjiang Liu, Liyang Chen, Guiping Zhong, Jiakui Li, Huimeng Wang, Haiyun Li, Liang Cao, Shiyin Kang, Zhiyong Wu

PDF

TL;DR

This paper introduces a novel autoregressive model for streaming target speaker extraction that maintains high performance and stability at low latencies, enabling real-time applications.

Contribution

It proposes a Chunk-wise Interleaved Splicing Paradigm and a historical context refinement mechanism for stable, efficient streaming TSE with autoregressive models.

Findings

01

Maintains 100% stability at low latencies in streaming TSE.

02

Achieves superior intelligibility compared to baseline models.

03

Real-Time-Factor of 0.248 on consumer GPUs.

Abstract

While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation due to the severe mismatch between training and streaming inference. To bridge this gap, we present the first autoregressive (AR) models tailored for streaming TSE. Our approach introduces a Chunk-wise Interleaved Splicing Paradigm that ensures highly efficient and stable streaming inference. To ensure the coherence between the extracted speech segments, we design a historical context refinement mechanism that mitigates boundary discontinuities by leveraging historical information. Experiments on Libri2Mix show that while AR generative baseline exhibits performance degradation at low latencies, our approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.