Precise Action-to-Video Generation Through Visual Action Prompts

Yuang Wang; Chao Wen; Haoyu Guo; Sida Peng; Minghan Qin; Hujun Bao; Xiaowei Zhou; Ruizhen Hu

arXiv:2508.13104·cs.CV·August 19, 2025

Precise Action-to-Video Generation Through Visual Action Prompts

Yuang Wang, Chao Wen, Haoyu Guo, Sida Peng, Minghan Qin, Hujun Bao, Xiaowei Zhou, Ruizhen Hu

PDF

Open Access

TL;DR

This paper introduces visual action prompts using visual skeletons to enable precise, cross-domain action-to-video generation of complex interactions, balancing precision and transferability.

Contribution

It proposes a novel visual skeleton-based action representation that enhances cross-domain transferability and precision in action-driven video generation models.

Findings

01

Effective cross-domain training with skeletons from HOI and robotic data

02

Improved control over complex interactions in generated videos

03

Preservation of dynamic transferability across domains

Abstract

We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Video Analysis and Summarization