SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads

Tan Yu; Qian Qiao; Le Shen; Ke Zhou; Jincheng Hu; Dian Sheng; Bo Hu; Haoming Qin; Jun Gao; Changhai Zhou; Shunshun Yin; Siyuan Liu

arXiv:2602.07449·cs.CV·February 12, 2026

SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads

Tan Yu, Qian Qiao, Le Shen, Ke Zhou, Jincheng Hu, Dian Sheng, Bo Hu, Haoming Qin, Jun Gao, Changhai Zhou, Shunshun Yin, Siyuan Liu

PDF

Open Access 1 Models 1 Datasets

TL;DR

SoulX-FlashHead is a novel real-time streaming talking head generation framework that balances high visual quality and low latency, utilizing oracle-guided distillation and streaming-aware pre-training for stability and efficiency.

Contribution

The paper introduces SoulX-FlashHead, a unified large-scale model with innovative training and guidance techniques for stable, high-fidelity, real-time streaming talking head synthesis.

Findings

01

Achieves state-of-the-art results on HDTF and VFHQ benchmarks.

02

Lite variant runs at 96 FPS on NVIDIA RTX 4090.

03

Effective in maintaining identity and temporal stability in long sequences.

Abstract

Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Soul-AILab/SoulX-FlashHead-1_3B
model· 1.2k dl· ♡ 38
1.2k dl♡ 38

Datasets

Soul-AILab/VividHead
dataset· 1.7k dl
1.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Face recognition and analysis