MSRAMIE: Multimodal Structured Reasoning Agent for Multi-instruction Image Editing
Zhaoyuan Qiu, Ken Chen, Xiangwei Wang, Yu Xia, Sachith Seneviratne, and Saman Halgamuge

TL;DR
MSRAMIE is a training-free multimodal reasoning framework that enhances multi-instruction image editing by decomposing complex tasks into iterative steps, improving instruction following and output quality without additional training.
Contribution
It introduces a novel reasoning topology with Tree-of-States and Graph-of-References, enabling systematic, interpretable multi-step image editing without retraining.
Findings
Improves instruction following accuracy by over 15% with complex instructions.
Achieves over 100% success rate in completing all modifications in a single run.
Maintains perceptual quality and visual consistency in edited images.
Abstract
Existing instruction-based image editing models perform well with simple, single-step instructions but degrade in realistic scenarios that involve multiple, lengthy, and interdependent directives. A main cause is the scarcity of training data with complex multi-instruction annotations. However, it is costly to collect such data and retrain these models. To address this challenge, we propose MSRAMIE, a training-free agent framework built on Multimodal Large Language Model (MLLM). MSRAMIE takes existing editing models as plug-in components and handle multi-instruction tasks via structured multimodal reasoning. It orchestrates iterative interactions between an MLLM-based Instructor and an image editing Actor, introducing a novel reasoning topology that comprises the proposed Tree-of-States and Graph-of-References. During inference, complex instructions are decomposed into multiple editing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques
