Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation
Jiangning Zhang, Junwei Zhu, Zhenye Gan, Donghao Luo, Chuming Lin, Feifan Xu, Xu Peng, Jianlong Hu, Yuansen Liu, Yijia Hong, Weijian Cao, Han Feng, Xu Chen, Chencan Fu, Keke He, Xiaobin Hu, Chengjie Wang

TL;DR
Soul is a multimodal framework that generates high-fidelity, long-term digital human animations from minimal input, achieving realistic lip-sync, expressions, and identity preservation, with a large annotated dataset and efficient inference.
Contribution
The paper introduces Soul, a novel multimodal-driven framework with a large annotated dataset and optimized training strategies for realistic, long-term digital human animation from limited inputs.
Findings
Outperforms existing models in video quality and lip-sync accuracy.
Achieves 11.4× faster inference with negligible quality loss.
Demonstrates broad applicability in virtual and film industries.
Abstract
We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed , which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
