SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation
Koichi Namekata, Sherwin Bahmani, Ziyi Wu, Yash Kant, Igor, Gilitschenski, David B. Lindell

TL;DR
SG-I2V introduces a zero-shot, self-guided framework for controllable image-to-video generation that leverages pre-trained models without fine-tuning, enabling precise control over video elements with high visual quality.
Contribution
This work presents a novel zero-shot control method for image-to-video generation that eliminates the need for fine-tuning or annotated datasets, improving efficiency and accessibility.
Findings
Outperforms unsupervised baselines in controllability and quality
Narrower gap with supervised models in visual fidelity and motion accuracy
Operates without additional training or external annotations
Abstract
Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guidedoffering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external…
Peer Reviews
Decision·ICLR 2025 Poster
S1: The paper analyzes the intrinsic relationship between the motion in the generated video and the UNet feature maps. S2: Building on this analysis, a self-supervised framework is designed to achieve zero-shot controllable image-to-video (I2V) generation.
W1: The performance values reported in Table 1 on the VIPSeg dataset differ from those in the DragAnything paper, which reports ObjMC, FVD, and FID values of 305.7, 494.8, and 33.5, respectively. Additionally, Table 1 shows DragNUWA outperforming DragAnything, which contradicts the findings in the DragAnything paper. Could the authors clarify the specific versions of DragAnything and DragNUWA used in their experiments, any differences in their evaluation setup compared to the original papers, an
This paper exhibits several strengths: 1.The motivation behind the research is clear and well-founded. 2.The use of latent optimization for trajectory control appears both intuitive and innovative. 3.The experimental results validate the effectiveness of the proposed method on the SVD model. The findings are convincing, and the ablation study is thorough.
This paper exhibits several Weaknesses: 1.There is a lack of generalization verification across various types of video diffusion models, such as VideoCrafter, EmuVideo (based on the U-Net structure), and CogVideoX-I2V (based on Diffusion Transformer). 2.The issue of inconsistent features across frames, as discussed in the introduction, may not be universally applicable. This problem might not be evident in some of the latest 3D full attention-based video diffusion models (e.g., OpenSoraPlan 1.2,
1. The proposed approach employs the feature map correspondences to regulate video latent updating and achieves object trajectory control in a zero-shot manner. 2. Both of the quantitative and qualitative experimental results demonstrate the effectiveness of the proposed approach. 3. The paper is well-written and the technical figures are very clear. 4. The investigation and ablation studies (including the visualization of feature maps) on the feature map selection is comprehensive for proposa
1. The major concern for this paper is about the technical contribution. Even though the weakness of the unsatisfactory feature alignment in the up-sampling blocks (the generalization problem in video domain) has been revealed, the latent optimization is still borrowed from the drag-based control methods, e.g., DragDiffusion. The feature map selection and validation are valuable in my personal opinion. Nevertheless, there is no concept new in the whole architecture design. Therefore, it can be t
Videos
Taxonomy
TopicsCell Image Analysis Techniques · Medical Image Segmentation Techniques · Advanced Vision and Imaging
MethodsDiffusion
