DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

Bohong Chen; Haiyang Liu

arXiv:2512.24408·cs.CV·February 3, 2026

DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

Bohong Chen, Haiyang Liu

PDF

Open Access

TL;DR

DyStream is a real-time dyadic talking head video generation model that achieves ultra-low latency and high lip-sync quality using flow matching and causal encoding with lookahead.

Contribution

The paper introduces DyStream, a novel flow matching-based autoregressive model with a causal encoder and lookahead for real-time dyadic talking head generation.

Findings

01

Generates video within 34 ms per frame

02

Maintains system latency under 100 ms

03

Achieves state-of-the-art lip-sync quality

Abstract

Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Face recognition and analysis