SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation

Le Shen; Qian Qiao; Tan Yu; Ke Zhou; Tianhang Yu; Yu Zhan; Zhenjie Wang; Ming Tao; Shunshun Yin; Siyuan Liu

arXiv:2512.23379·cs.CV·January 7, 2026

SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation

Le Shen, Qian Qiao, Tan Yu, Ke Zhou, Tianhang Yu, Yu Zhan, Zhenjie Wang, Ming Tao, Shunshun Yin, Siyuan Liu

PDF

Open Access 1 Models

TL;DR

SoulX-FlashTalk is a 14B-parameter system that enables real-time, high-fidelity, infinite streaming of audio-driven avatars by using bidirectional attention, self-correction, and optimized inference techniques.

Contribution

It introduces a novel bidirectional distillation strategy and self-correction mechanism for stable, high-quality real-time avatar generation at scale.

Findings

01

Achieves sub-second startup latency (0.87s)

02

Reaches 32 FPS real-time throughput

03

First 14B-scale system for high-fidelity interactive avatars

Abstract

Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf{SoulX-FlashTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf{Multi-step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Soul-AILab/SoulX-FlashTalk-14B
model· 2.4k dl· ♡ 35
2.4k dl♡ 35

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing