InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following
Shufan Li, Harkanwar Singh, Aditya Grover

TL;DR
InstructAny2Pix is a versatile multi-modal system that enables fine-grained visual editing using instructions involving text, images, and audio, advancing controllability in image generation and editing.
Contribution
We introduce a multi-modal instruction-following framework that integrates audio, images, and text for flexible visual editing, with a novel unified encoding and decoding architecture.
Findings
Successfully performs complex multi-modal editing tasks
Improves visual quality with a refinement prior module
Demonstrates versatility across different instruction types
Abstract
The ability to provide fine-grained control for generating and editing visual imagery has profound implications for computer vision and its applications. Previous works have explored extending controllability in two directions: instruction tuning with text-based prompts and multi-modal conditioning. However, these works make one or more unnatural assumptions on the number and/or type of modality inputs used to express controllability. We propose InstructAny2Pix, a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text. InstructAny2Pix consists of three building blocks that facilitate this capability: a multi-modal encoder that encodes different modalities such as images and audio into a unified latent space, a diffusion model that learns to decode representations in this latent space into images,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Vision and Imaging
MethodsDiffusion
