TL;DR
MTVCraft introduces a novel framework that directly models raw 4D motion sequences for character image animation, enabling more flexible, robust, and zero-shot capable animation of diverse characters and objects.
Contribution
It is the first framework to tokenize 4D motion for character animation, improving generalization and control over existing 2D-based methods.
Findings
Achieves state-of-the-art performance on TikTok and Fashion benchmarks.
Demonstrates robust zero-shot generalization across diverse characters and objects.
Scalable to different model sizes and applicable to various animation styles.
Abstract
Character image animation has rapidly advanced with the rise of digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 4D information for open-world animation. To address this, we propose MTVCraft (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for character image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatial-temporal cues and avoid strict pixel-level alignment between pose images and the character, enabling more flexible and disentangled control. Next, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings,…
Peer Reviews
Decision·ICLR 2026 Poster
1. Clear paradigm shift from 2D renderings to discrete 4D motion tokens, with a well-articulated rationale (coordinates vs. parameters) and an architecture that uses motion tokens natively (4D RoPE + motion attention). 2. Strong empirical gains on both TikTok and Fashion. The 18B model improves FID/FVD while modestly raising SSIM/PSNR over strong baselines. 3. Scalable & practical. The 18B integration is straightforward (zero-padding alignment), and the paper documents unsuccessful alternativ
1. Camera view handling are implicit. There’s no explicit camera-parameter conditioning. The method relies on data diversity and 4D tokens. This is workable, but leaves questions about view transitions or long-term 3D consistency. 2. As SMPL joint are hard to accurately estimated, how the authors ensure that annotation quality? Besides, why don't use SMPL-X which includes hands? 3. For the heavy 18B model, will the inference cost of the proposed model be 10x or even 100x of the previous U-net
1. The paper addresses a limitation of current methods by replacing fragile 2D pose images with robust, compact 4D motion tokens derived directly from SMPL joint coordinates. 2. Experimental results indicate the effectiveness of the proposed method.
1. The method introduces a new, independently trained component: the 4D Motion Tokenizer (4DMoT). Training this additional encoder (a VQVAE) adds complexity to the overall pipeline and represents an extra component that must be learned, stored, and maintained, potentially limiting the ability to scale the entire framework compared to methods that use off-the-shelf 2D pose estimators. During inference, the system requires an additional forward pass through the 4DMoT encoder to generate the motion
A key strength of integrating a motion-generation pipeline into video generation lies in its ability to provide explicit temporal and structural control over motion, resulting in more coherent and realistic dynamics than end-to-end pixel-based video synthesis. By introducing intermediate motion representations, such as in motion generation domain [1,2], the framework captures fine-grained spatial-temporal cues that general video models often overlook. This separation of motion from appearance al
A potential weakness of this paradigm is that the 4D motion compression via 4DMoT is conceptually straightforward and not architecturally novel. The encoder-decoder with vector quantization closely follows standard VQVAE formulations, and while it effectively transforms SMPL joint trajectories into compact motion tokens, it does not introduce fundamentally new techniques in motion encoding or representation learning. However, despite this structural simplicity, the usage and integration of such
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
