SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads
Tan Yu, Qian Qiao, Le Shen, Ke Zhou, Jincheng Hu, Dian Sheng, Bo Hu, Haoming Qin, Jun Gao, Changhai Zhou, Shunshun Yin, Siyuan Liu

TL;DR
SoulX-FlashHead is a novel real-time streaming talking head generation framework that balances high visual quality and low latency, utilizing oracle-guided distillation and streaming-aware pre-training for stability and efficiency.
Contribution
The paper introduces SoulX-FlashHead, a unified large-scale model with innovative training and guidance techniques for stable, high-fidelity, real-time streaming talking head synthesis.
Findings
Achieves state-of-the-art results on HDTF and VFHQ benchmarks.
Lite variant runs at 96 FPS on NVIDIA RTX 4090.
Effective in maintaining identity and temporal stability in long sequences.
Abstract
Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Face recognition and analysis
