MIRA: Multimodal Iterative Reasoning Agent for Image Editing
Ziyun Zeng, Hang Hua, Jiebo Luo

TL;DR
MIRA is a multimodal reasoning agent that iteratively interprets and executes complex image editing instructions, improving accuracy and quality through visual feedback and a specialized dataset.
Contribution
Introduces MIRA, a novel iterative reasoning framework for image editing that leverages multimodal feedback and a new dataset for enhanced performance.
Findings
MIRA outperforms existing models in semantic consistency.
MIRA achieves comparable or better results than proprietary systems.
The approach effectively handles complex, compositional editing instructions.
Abstract
Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship
