Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation

Jiangning Zhang; Junwei Zhu; Zhenye Gan; Donghao Luo; Chuming Lin; Feifan Xu; Xu Peng; Jianlong Hu; Yuansen Liu; Yijia Hong; Weijian Cao; Han Feng; Xu Chen; Chencan Fu; Keke He; Xiaobin Hu; Chengjie Wang

arXiv:2512.13495·cs.CV·December 16, 2025

Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation

Jiangning Zhang, Junwei Zhu, Zhenye Gan, Donghao Luo, Chuming Lin, Feifan Xu, Xu Peng, Jianlong Hu, Yuansen Liu, Yijia Hong, Weijian Cao, Han Feng, Xu Chen, Chencan Fu, Keke He, Xiaobin Hu, Chengjie Wang

PDF

Open Access 2 Datasets

TL;DR

Soul is a multimodal framework that generates high-fidelity, long-term digital human animations from minimal input, achieving realistic lip-sync, expressions, and identity preservation, with a large annotated dataset and efficient inference.

Contribution

The paper introduces Soul, a novel multimodal-driven framework with a large annotated dataset and optimized training strategies for realistic, long-term digital human animation from limited inputs.

Findings

01

Outperforms existing models in video quality and lip-sync accuracy.

02

Achieves 11.4× faster inference with negligible quality loss.

03

Demonstrates broad applicability in virtual and film industries.

Abstract

We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $Soul$ , which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing