CharacterShot: Controllable and Consistent 4D Character Animation

Junyao Gao; Jiaxing Li; Wenran Liu; Yanhong Zeng; Fei Shen; Kai Chen; Yanan Sun; Cairong Zhao

arXiv:2508.07409·cs.CV·August 12, 2025

CharacterShot: Controllable and Consistent 4D Character Animation

Junyao Gao, Jiaxing Li, Wenran Liu, Yanhong Zeng, Fei Shen, Kai Chen, Yanan Sun, Cairong Zhao

PDF

Open Access 2 Models 1 Datasets 3 Reviews

TL;DR

CharacterShot is a novel framework that enables the creation of controllable, consistent 4D character animations from a single image and 2D pose sequence, utilizing advanced 2D-to-3D lifting and optimization techniques.

Contribution

The paper introduces a new 4D character animation method combining a DiT-based 2D model, dual-attention 3D lifting, and neighbor-constrained Gaussian splatting, along with a large-scale dataset and benchmark.

Findings

01

Outperforms state-of-the-art methods on CharacterBench.

02

Produces spatial-temporal and spatial-view consistent 4D animations.

03

Enables controllable animation from minimal input data.

Abstract

In this paper, we propose \textbf{CharacterShot}, a controllable and consistent 4D character animation framework that enables any individual designer to create dynamic 3D characters (i.e., 4D character animation) from a single reference character image and a 2D pose sequence. We begin by pretraining a powerful 2D character animation model based on a cutting-edge DiT-based image-to-video model, which allows for any 2D pose sequnce as controllable signal. We then lift the animation model from 2D to 3D through introducing dual-attention module together with camera prior to generate multi-view videos with spatial-temporal and spatial-view consistency. Finally, we employ a novel neighbor-constrained 4D gaussian splatting optimization on these multi-view videos, resulting in continuous and stable 4D character representations. Moreover, to improve character-centric performance, we construct a…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

* A novel framework that takes a 2D character image and a 2D pose sequence for 4D generation sounds promising, as the 2D input is more convenient than typical input while providing decent control over the generated motion. * The result demonstrates superior multi-view consistency for the generated 4D character animation. * A novel dataset built for the novel framework allows for further exploration of the idea.

Weaknesses

* Single-view video to 4D baselines, not just 2D to video part, could also be fine-tuned for fair comparison. Their quality degradation is more noticeable in novel views, which may be due to a lack of training data for generating multi-view character videos from single-view character videos. * Related works discuss prior works with too much focus on the general trend of generation methods. A better summary and greater emphasis on highly relevant works scattered across different sections on chara

Reviewer 02Rating 6Confidence 3

Strengths

The visual quality of the multi-view character videos generated by CharacterShot appears clean and consistent, and very close to the ground truth. The dual-attention module, which uses parallel 3D full attention blocks to enforce visual consistency across spatial-temporal multi-view images, is an interesting and novel approach. The coarse-to-fine 3DGS, including the neighbor constraints in the fine stage, appears a reasonable post-processing to improve the character video.

Weaknesses

Since CharacterShot employs the I2V model CogVideoX that is DiT-based to generate the video given a character image, the paper claims this is the first DiT-based 4D character animation work. This claim is not a well-supported one to me. The major experiments are only conducted on the new CharacterBench dataset introduced in this paper. The fairness of the comparison with 4 other methods on the this CharacterBench needs further justification, e.g., if other methods have been fine-tuned on the C

Reviewer 03Rating 4Confidence 4

Strengths

Novel and Practical Problem Formulation: Generating 4D animation from a single image and 2D poses is a highly challenging yet valuable task. It significantly lowers the barrier to 4D content creation, as its input requirements are far less restrictive than methods requiring multi-view videos, 3D models, or even single-view videos. This has strong practical application potential. Solid Technical Contributions: The paper proposes a complete and technically sound pipeline to address this complex p

Weaknesses

Self-Serving Evaluation on a Niche Benchmark: A major weakness is that all quantitative comparisons rely exclusively on the authors' newly created CharacterBench, which is built from their own Character4D dataset. While dataset contribution is noted, this creates a circular evaluation loop where the method is tested on the same data distribution it was trained on (or at least a very similar one, derived from VRoidHub). This benchmark, filled with 13k anime-style characters, may not be representa

Code & Models

Models

Datasets

Gaojunyao/Character4D
dataset· 117 dl
117 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis