MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance

Xuehai Bai; Xiaoling Gu; Akide Liu; Hangjie Yuan; YiFan Zhang; Jack Ma

arXiv:2602.07993·cs.CV·February 10, 2026

MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance

Xuehai Bai, Xiaoling Gu, Akide Liu, Hangjie Yuan, YiFan Zhang, Jack Ma

PDF

Open Access 1 Video

TL;DR

This paper introduces MCIE-E1, a multimodal large language model-driven approach for complex instruction image editing that improves instruction compliance and background consistency using spatial and background modules, supported by a new dataset and benchmark.

Contribution

The paper presents a novel architecture with spatial-aware and background-consistent modules, a dedicated data pipeline, and a new benchmark for complex instruction image editing.

Findings

01

Achieves 23.96% improvement in instruction compliance.

02

Outperforms previous methods in quantitative and qualitative evaluations.

03

Introduces CIE-Bench with new evaluation metrics.

Abstract

Recent advances in instruction-based image editing have shown remarkable progress. However, existing methods remain limited to relatively simple editing operations, hindering real-world applications that require complex and compositional instructions. In this work, we address these limitations from the perspectives of architectural design, data, and evaluation protocols. Specifically, we identify two key challenges in current models: insufficient instruction compliance and background inconsistency. To this end, we propose MCIE-E1, a Multimodal Large Language Model-Driven Complex Instruction Image Editing method that integrates two key modules: a spatial-aware cross-attention module and a background-consistent cross-attention module. The former enhances instruction-following capability by explicitly aligning semantic instructions with spatial regions through spatial guidance during the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance· underline

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques