Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts
Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Chenyang Qi, Chengfei, Cai, Xiu Li, Zhifeng Li, Heung-Yeung Shum, Wei Liu, and Qifeng Chen

TL;DR
Follow-Your-Click introduces a user-friendly image-to-video generation framework that allows precise local object control using simple clicks and short prompts, improving quality and controllability over existing methods.
Contribution
The paper presents the first-frame masking strategy, a motion-augmented module with a short prompt dataset, and flow-based motion control, enhancing local control and generation quality in image animation.
Findings
Outperforms 7 baselines on 8 metrics
Achieves better control and quality than previous methods
Enables simple user interaction for precise local animation
Abstract
Despite recent advances in image-to-video generation, better controllability and local animation are less explored. Most existing image-to-video methods are not locally aware and tend to move the entire scene. However, human artists may need to control the movement of different objects or regions. Additionally, current I2V methods require users not only to describe the target motion but also to provide redundant detailed descriptions of frame contents. These two issues hinder the practical utilization of current I2V tools. In this paper, we propose a practical framework, named Follow-Your-Click, to achieve image animation with a simple user click (for specifying what to move) and a short motion prompt (for specifying how to move). Technically, we propose the first-frame masking strategy, which significantly improves the video generation quality, and a motion-augmented module equipped…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Attentive Walk-Aggregating Graph Neural Network
