MTVCraft: Tokenizing 4D Motion for Arbitrary Character Animation

Yanbo Ding; Xirui Hu; Zhizhi Guo; Yan Zhang; Xinrui Wang; Zhixiang He; Chi Zhang; Yali Wang; Xuelong Li

arXiv:2505.10238·cs.CV·March 10, 2026

MTVCraft: Tokenizing 4D Motion for Arbitrary Character Animation

Yanbo Ding, Xirui Hu, Zhizhi Guo, Yan Zhang, Xinrui Wang, Zhixiang He, Chi Zhang, Yali Wang, Xuelong Li

PDF

1 Repo 1 Models 3 Reviews

TL;DR

MTVCraft introduces a novel framework that directly models raw 4D motion sequences for character image animation, enabling more flexible, robust, and zero-shot capable animation of diverse characters and objects.

Contribution

It is the first framework to tokenize 4D motion for character animation, improving generalization and control over existing 2D-based methods.

Findings

01

Achieves state-of-the-art performance on TikTok and Fashion benchmarks.

02

Demonstrates robust zero-shot generalization across diverse characters and objects.

03

Scalable to different model sizes and applicable to various animation styles.

Abstract

Character image animation has rapidly advanced with the rise of digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 4D information for open-world animation. To address this, we propose MTVCraft (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for character image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatial-temporal cues and avoid strict pixel-level alignment between pose images and the character, enabling more flexible and disentangled control. Next, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. Clear paradigm shift from 2D renderings to discrete 4D motion tokens, with a well-articulated rationale (coordinates vs. parameters) and an architecture that uses motion tokens natively (4D RoPE + motion attention). 2. Strong empirical gains on both TikTok and Fashion. The 18B model improves FID/FVD while modestly raising SSIM/PSNR over strong baselines. 3. Scalable & practical. The 18B integration is straightforward (zero-padding alignment), and the paper documents unsuccessful alternativ

Weaknesses

1. Camera view handling are implicit. There’s no explicit camera-parameter conditioning. The method relies on data diversity and 4D tokens. This is workable, but leaves questions about view transitions or long-term 3D consistency. 2. As SMPL joint are hard to accurately estimated, how the authors ensure that annotation quality? Besides, why don't use SMPL-X which includes hands? 3. For the heavy 18B model, will the inference cost of the proposed model be 10x or even 100x of the previous U-net

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper addresses a limitation of current methods by replacing fragile 2D pose images with robust, compact 4D motion tokens derived directly from SMPL joint coordinates. 2. Experimental results indicate the effectiveness of the proposed method.

Weaknesses

1. The method introduces a new, independently trained component: the 4D Motion Tokenizer (4DMoT). Training this additional encoder (a VQVAE) adds complexity to the overall pipeline and represents an extra component that must be learned, stored, and maintained, potentially limiting the ability to scale the entire framework compared to methods that use off-the-shelf 2D pose estimators. During inference, the system requires an additional forward pass through the 4DMoT encoder to generate the motion

Reviewer 03Rating 8Confidence 3

Strengths

A key strength of integrating a motion-generation pipeline into video generation lies in its ability to provide explicit temporal and structural control over motion, resulting in more coherent and realistic dynamics than end-to-end pixel-based video synthesis. By introducing intermediate motion representations, such as in motion generation domain [1,2], the framework captures fine-grained spatial-temporal cues that general video models often overlook. This separation of motion from appearance al

Weaknesses

A potential weakness of this paradigm is that the 4D motion compression via 4DMoT is conceptually straightforward and not architecturally novel. The encoder-decoder with vector quantization closely follows standard VQVAE formulations, and while it effectively transforms SMPL joint trajectories into compact motion tokens, it does not introduce fundamentally new techniques in motion encoding or representation learning. However, despite this structural simplicity, the usage and integration of such

Code & Models

Repositories

dingyanb/mtvcrafter
pytorchOfficial

Models

🤗
yanboding/MTVCrafter
model· ♡ 30
♡ 30

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need