StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding

Dake Guo; Jixun Yao; Linhan Ma; He Wang; Lei Xie

arXiv:2506.23986·cs.SD·July 2, 2025

StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding

Dake Guo, Jixun Yao, Linhan Ma, He Wang, Lei Xie

PDF

Open Access

TL;DR

StreamFlow introduces a streaming neural architecture with block-wise attention masks for real-time speech token decoding, achieving high-quality audio with low latency by effectively managing long-sequence dependencies.

Contribution

The paper proposes a novel streaming flow matching model using block-wise attention masks within diffusion transformers to improve real-time speech generation quality and efficiency.

Findings

01

Achieves comparable quality to non-streaming methods.

02

Outperforms existing streaming methods in speech quality.

03

First-packet latency of only 180 ms.

Abstract

Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token streaming speech generation often results in degraded audio quality. To address these challenges, we propose StreamFlow, a novel neural architecture that facilitates streaming flow matching with diffusion transformers (DiT). To mitigate the long-sequence extrapolation issues arising from lengthy historical dependencies, we design a local block-wise receptive field strategy. Specifically, the sequence is first segmented into blocks, and we introduce block-wise attention masks that enable the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis