DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model
Bohong Chen, Haiyang Liu

TL;DR
DyStream is a real-time dyadic talking head video generation model that achieves ultra-low latency and high lip-sync quality using flow matching and causal encoding with lookahead.
Contribution
The paper introduces DyStream, a novel flow matching-based autoregressive model with a causal encoder and lookahead for real-time dyadic talking head generation.
Findings
Generates video within 34 ms per frame
Maintains system latency under 100 ms
Achieves state-of-the-art lip-sync quality
Abstract
Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Face recognition and analysis
