Instruction-based Image Manipulation by Watching How Things Move
Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, Zhihao Xia

TL;DR
This paper presents a new dataset and model for instruction-based image manipulation, leveraging video data and large language models to enable complex, realistic edits that are hard to generate synthetically.
Contribution
The paper introduces a novel dataset construction pipeline using videos and multimodal large language models, leading to the development of InstructMove for advanced image editing tasks.
Findings
InstructMove achieves state-of-the-art results in pose adjustment.
The dataset enables complex manipulations like element rearrangement.
Video-based data captures natural dynamics for realistic editing.
Abstract
This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics-such as non-rigid subject motion and complex camera movements-that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications
