EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation
Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma

TL;DR
EchoMimicV3 is a unified, efficient human animation framework using a 1.3B parameter model that handles multiple tasks and modalities simultaneously, reducing costs and improving performance.
Contribution
The paper introduces a novel unified multi-task and multi-modal human animation framework with innovative training strategies and modules, enabling efficient and versatile animation.
Findings
Achieves competitive performance with a minimal 1.3B parameter model.
Effectively unifies multi-task and multi-modal human animation.
Demonstrates efficiency and versatility in extensive experiments.
Abstract
Recent work on human animation usually incorporates large-scale video models, thereby achieving more vivid performance. However, the practical use of such methods is hindered by the slow inference speed and high computational demands. Moreover, traditional work typically employs separate models for each animation task, increasing costs in multi-task scenarios and worsening the dilemma. To address these limitations, we introduce EchoMimicV3, an efficient framework that unifies multi-task and multi-modal human animation. At the core of EchoMimicV3 lies a threefold design: a Soup-of-Tasks paradigm, a Soup-of-Modals paradigm, and a novel training and inference strategy. The Soup-of-Tasks leverages multi-task mask inputs and a counter-intuitive task allocation strategy to achieve multi-task gains without multi-model pains. Meanwhile, the Soup-of-Modals introduces a Coupled-Decoupled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAugmented Reality Applications · Anatomy and Medical Technology · 3D Shape Modeling and Analysis
