SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory
Dingcheng Zhen, Xu Zheng, Ruixin Zhang, Zhiqi Jiang, Yichao Yan, Ming Tao, Shunshun Yin

TL;DR
This paper introduces SoulX-LiveAct, a novel AR diffusion framework for hour-scale real-time human animation that improves training stability, inference efficiency, and animation quality, enabling 20 FPS streaming on minimal hardware.
Contribution
It proposes Neighbor Forcing for stable, diffusion-step-consistent propagation and ConvKV memory for constant-memory inference, advancing hour-scale real-time human animation.
Findings
Supports 20 FPS real-time streaming on 2 GPUs.
Achieves state-of-the-art lip-sync and animation quality.
Significantly improves training convergence and inference efficiency.
Abstract
Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis
