CharacterShot: Controllable and Consistent 4D Character Animation
Junyao Gao, Jiaxing Li, Wenran Liu, Yanhong Zeng, Fei Shen, Kai Chen, Yanan Sun, Cairong Zhao

TL;DR
CharacterShot is a novel framework that enables the creation of controllable, consistent 4D character animations from a single image and 2D pose sequence, utilizing advanced 2D-to-3D lifting and optimization techniques.
Contribution
The paper introduces a new 4D character animation method combining a DiT-based 2D model, dual-attention 3D lifting, and neighbor-constrained Gaussian splatting, along with a large-scale dataset and benchmark.
Findings
Outperforms state-of-the-art methods on CharacterBench.
Produces spatial-temporal and spatial-view consistent 4D animations.
Enables controllable animation from minimal input data.
Abstract
In this paper, we propose \textbf{CharacterShot}, a controllable and consistent 4D character animation framework that enables any individual designer to create dynamic 3D characters (i.e., 4D character animation) from a single reference character image and a 2D pose sequence. We begin by pretraining a powerful 2D character animation model based on a cutting-edge DiT-based image-to-video model, which allows for any 2D pose sequnce as controllable signal. We then lift the animation model from 2D to 3D through introducing dual-attention module together with camera prior to generate multi-view videos with spatial-temporal and spatial-view consistency. Finally, we employ a novel neighbor-constrained 4D gaussian splatting optimization on these multi-view videos, resulting in continuous and stable 4D character representations. Moreover, to improve character-centric performance, we construct a…
Peer Reviews
Decision·Submitted to ICLR 2026
* A novel framework that takes a 2D character image and a 2D pose sequence for 4D generation sounds promising, as the 2D input is more convenient than typical input while providing decent control over the generated motion. * The result demonstrates superior multi-view consistency for the generated 4D character animation. * A novel dataset built for the novel framework allows for further exploration of the idea.
* Single-view video to 4D baselines, not just 2D to video part, could also be fine-tuned for fair comparison. Their quality degradation is more noticeable in novel views, which may be due to a lack of training data for generating multi-view character videos from single-view character videos. * Related works discuss prior works with too much focus on the general trend of generation methods. A better summary and greater emphasis on highly relevant works scattered across different sections on chara
The visual quality of the multi-view character videos generated by CharacterShot appears clean and consistent, and very close to the ground truth. The dual-attention module, which uses parallel 3D full attention blocks to enforce visual consistency across spatial-temporal multi-view images, is an interesting and novel approach. The coarse-to-fine 3DGS, including the neighbor constraints in the fine stage, appears a reasonable post-processing to improve the character video.
Since CharacterShot employs the I2V model CogVideoX that is DiT-based to generate the video given a character image, the paper claims this is the first DiT-based 4D character animation work. This claim is not a well-supported one to me. The major experiments are only conducted on the new CharacterBench dataset introduced in this paper. The fairness of the comparison with 4 other methods on the this CharacterBench needs further justification, e.g., if other methods have been fine-tuned on the C
Novel and Practical Problem Formulation: Generating 4D animation from a single image and 2D poses is a highly challenging yet valuable task. It significantly lowers the barrier to 4D content creation, as its input requirements are far less restrictive than methods requiring multi-view videos, 3D models, or even single-view videos. This has strong practical application potential. Solid Technical Contributions: The paper proposes a complete and technically sound pipeline to address this complex p
Self-Serving Evaluation on a Niche Benchmark: A major weakness is that all quantitative comparisons rely exclusively on the authors' newly created CharacterBench, which is built from their own Character4D dataset. While dataset contribution is noted, this creates a circular evaluation loop where the method is tested on the same data distribution it was trained on (or at least a very similar one, derived from VRoidHub). This benchmark, filled with 13k anime-style characters, may not be representa
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis
