Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

Zhenghong Zhou; Xiaohang Zhan; Zhiqin Chen; Soo Ye Kim; Nanxuan Zhao; Haitian Zheng; Qing Liu; He Zhang; Zhe Lin; Yuqian Zhou; Jiebo Luo

arXiv:2603.15614·cs.CV·March 17, 2026

Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

Zhenghong Zhou, Xiaohang Zhan, Zhiqin Chen, Soo Ye Kim, Nanxuan Zhao, Haitian Zheng, Qing Liu, He Zhang, Zhe Lin, Yuqian Zhou, Jiebo Luo

PDF

Open Access

TL;DR

Tri-Prompting introduces a unified video diffusion framework that enables precise control over scene, subject, and motion, enhancing multi-view consistency and identity preservation for versatile content creation.

Contribution

It presents a novel unified architecture and training paradigm that integrates scene, subject, and motion control in video diffusion models, supporting complex, multi-faceted video editing tasks.

Findings

01

Outperforms specialized baselines in multi-view subject identity preservation

02

Achieves superior 3D consistency and motion accuracy

03

Enables novel workflows like 3D-aware subject insertion

Abstract

Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Human Pose and Action Recognition