FramePrompt: In-context Controllable Animation with Zero Structural Changes

Guian Fang; Yuchao Gu; Mike Zheng Shou

arXiv:2506.17301·cs.GR·July 3, 2025

FramePrompt: In-context Controllable Animation with Zero Structural Changes

Guian Fang, Yuchao Gu, Mike Zheng Shou

PDF

TL;DR

FramePrompt is a minimalist framework that leverages pre-trained video diffusion transformers to generate controllable character animations from reference images and motion cues without structural modifications.

Contribution

It introduces a sequence-level visual conditioning approach that simplifies controllable animation by avoiding complex architectures and guider modules.

Findings

01

Outperforms baseline methods on multiple metrics

02

Simplifies training process

03

Effectively uses pre-trained models for animation control

Abstract

Generating controllable character animation from a reference image and motion guidance remains a challenging task due to the inherent difficulty of injecting appearance and motion cues into video diffusion models. Prior works often rely on complex architectures, explicit guider modules, or multi-stage processing pipelines, which increase structural overhead and hinder deployment. Inspired by the strong visual context modeling capacity of pre-trained video diffusion transformers, we propose FramePrompt, a minimalist yet powerful framework that treats reference images, skeleton-guided motion, and target video clips as a unified visual sequence. By reformulating animation as a conditional future prediction task, we bypass the need for guider networks and structural modifications. Experiments demonstrate that our method significantly outperforms representative baselines across various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.