Trajectory Attention for Fine-grained Video Motion Control
Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si,, Xingang Pan

TL;DR
This paper presents trajectory attention, a novel method for fine-grained camera motion control in video generation, improving precision and consistency by integrating pixel trajectory information into the attention mechanism.
Contribution
Introduces trajectory attention as an auxiliary branch to enhance camera motion control in video diffusion models, enabling better temporal correlation and content consistency.
Findings
Significant improvements in motion control precision.
Enhanced long-range temporal consistency.
Effective extension to video editing tasks.
Abstract
Recent advancements in video generation have been greatly driven by video diffusion models, with camera motion control emerging as a crucial challenge in creating view-customized visual content. This paper introduces trajectory attention, a novel approach that performs attention along available pixel trajectories for fine-grained camera motion control. Unlike existing methods that often yield imprecise outputs or neglect temporal correlations, our approach possesses a stronger inductive bias that seamlessly injects trajectory information into the video generation process. Importantly, our approach models trajectory attention as an auxiliary branch alongside traditional temporal attention. This design enables the original temporal attention and the trajectory attention to work in synergy, ensuring both precise motion control and new content generation capability, which is critical when…
Peer Reviews
Decision·ICLR 2025 Poster
1. The proposed method is lightweight, requiring low training costs, making it practical and efficient for real-world applications without the need for extensive computational resources. 2. The method demonstrates strong transferability, showing effectiveness with different architectures such as DiT. 3. The paper conducts thorough exploration at application level, showcasing the method's effectiveness in multiple tasks, including camera motion control and video editing. Abalation studies are su
1. The method heavily relies on dense optical flow information, as shown in Figure 3 of the supplementary material. This dependency can significantly increase inference time due to the computational cost of processing dense optical flow, especially in real-time applications. 2. The reliance on dense optical flow makes it challenging to adapt the method to user inputs of sparse trajectories. As noted in DragNUWA, it's difficult for users to input precise trajectories at key points in practical a
1. The paper introduces a novel concept of Trajectory Attention for fine-grained motion control in video generation. This auxiliary attention mechanism enhances the existing temporal attention in video diffusion models by explicitly incorporating trajectory information, which is a significant advancement in the field. 2. By modeling trajectory attention as an auxiliary branch that works alongside the original temporal attention, the approach allows for seamless integration without modifying the
1. The method is primarily designed for video diffusion models that use decomposed spatial-temporal attention. It is less clear how well the approach generalizes to models with integrated spatial-temporal attention (e.g. 3D DiTs) or other architectures. Expanding the evaluation to include such models would strengthen the contribution. 2. The paper compares the proposed method with a limited set of existing approaches. Including discussions with more recent or state-of-the-art methods, especially
- Metric-wise, it seems the model achieves better camera control. - The model can be used for first-edited-frame + original-video-guided editing, though how this is achieved is not very clear.
1) Figure-1 is confusing. It takes some time to understand the input and output of each task. It would be better to reorganize this figure to make it clearer. Each task could be separated into a small sub-figure with a clear indication of the input and output. 2) In Figure-3, it’s unclear what the model’s input is in two scenarios: (1) when you have multiple frames as input, i.e., ‘camera motion control on videos’ in Figure-1, and (2) when you have multiple frames plus edited frames as input, i
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Medical Image Segmentation Techniques
MethodsSoftmax · Attention Is All You Need · Diffusion
