Instruction-based Image Manipulation by Watching How Things Move

Mingdeng Cao; Xuaner Zhang; Yinqiang Zheng; Zhihao Xia

arXiv:2412.12087·cs.CV·December 17, 2024

Instruction-based Image Manipulation by Watching How Things Move

Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, Zhihao Xia

PDF

Open Access

TL;DR

This paper presents a new dataset and model for instruction-based image manipulation, leveraging video data and large language models to enable complex, realistic edits that are hard to generate synthetically.

Contribution

The paper introduces a novel dataset construction pipeline using videos and multimodal large language models, leading to the development of InstructMove for advanced image editing tasks.

Findings

01

InstructMove achieves state-of-the-art results in pose adjustment.

02

The dataset enables complex manipulations like element rearrangement.

03

Video-based data captures natural dynamics for realistic editing.

Abstract

This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics-such as non-rigid subject motion and complex camera movements-that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications