FramePrompt: In-context Controllable Animation with Zero Structural Changes
Guian Fang, Yuchao Gu, Mike Zheng Shou

TL;DR
FramePrompt is a minimalist framework that leverages pre-trained video diffusion transformers to generate controllable character animations from reference images and motion cues without structural modifications.
Contribution
It introduces a sequence-level visual conditioning approach that simplifies controllable animation by avoiding complex architectures and guider modules.
Findings
Outperforms baseline methods on multiple metrics
Simplifies training process
Effectively uses pre-trained models for animation control
Abstract
Generating controllable character animation from a reference image and motion guidance remains a challenging task due to the inherent difficulty of injecting appearance and motion cues into video diffusion models. Prior works often rely on complex architectures, explicit guider modules, or multi-stage processing pipelines, which increase structural overhead and hinder deployment. Inspired by the strong visual context modeling capacity of pre-trained video diffusion transformers, we propose FramePrompt, a minimalist yet powerful framework that treats reference images, skeleton-guided motion, and target video clips as a unified visual sequence. By reformulating animation as a conditional future prediction task, we bypass the need for guider networks and structural modifications. Experiments demonstrate that our method significantly outperforms representative baselines across various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
