MSRAMIE: Multimodal Structured Reasoning Agent for Multi-instruction Image Editing

Zhaoyuan Qiu; Ken Chen; Xiangwei Wang; Yu Xia; Sachith Seneviratne; and Saman Halgamuge

arXiv:2603.16967·cs.CV·March 19, 2026

MSRAMIE: Multimodal Structured Reasoning Agent for Multi-instruction Image Editing

Zhaoyuan Qiu, Ken Chen, Xiangwei Wang, Yu Xia, Sachith Seneviratne, and Saman Halgamuge

PDF

Open Access

TL;DR

MSRAMIE is a training-free multimodal reasoning framework that enhances multi-instruction image editing by decomposing complex tasks into iterative steps, improving instruction following and output quality without additional training.

Contribution

It introduces a novel reasoning topology with Tree-of-States and Graph-of-References, enabling systematic, interpretable multi-step image editing without retraining.

Findings

01

Improves instruction following accuracy by over 15% with complex instructions.

02

Achieves over 100% success rate in completing all modifications in a single run.

03

Maintains perceptual quality and visual consistency in edited images.

Abstract

Existing instruction-based image editing models perform well with simple, single-step instructions but degrade in realistic scenarios that involve multiple, lengthy, and interdependent directives. A main cause is the scarcity of training data with complex multi-instruction annotations. However, it is costly to collect such data and retrain these models. To address this challenge, we propose MSRAMIE, a training-free agent framework built on Multimodal Large Language Model (MLLM). MSRAMIE takes existing editing models as plug-in components and handle multi-instruction tasks via structured multimodal reasoning. It orchestrates iterative interactions between an MLLM-based Instructor and an image editing Actor, introducing a novel reasoning topology that comprises the proposed Tree-of-States and Graph-of-References. During inference, complex instructions are decomposed into multiple editing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques