MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance
Xuehai Bai, Xiaoling Gu, Akide Liu, Hangjie Yuan, YiFan Zhang, Jack Ma

TL;DR
This paper introduces MCIE-E1, a multimodal large language model-driven approach for complex instruction image editing that improves instruction compliance and background consistency using spatial and background modules, supported by a new dataset and benchmark.
Contribution
The paper presents a novel architecture with spatial-aware and background-consistent modules, a dedicated data pipeline, and a new benchmark for complex instruction image editing.
Findings
Achieves 23.96% improvement in instruction compliance.
Outperforms previous methods in quantitative and qualitative evaluations.
Introduces CIE-Bench with new evaluation metrics.
Abstract
Recent advances in instruction-based image editing have shown remarkable progress. However, existing methods remain limited to relatively simple editing operations, hindering real-world applications that require complex and compositional instructions. In this work, we address these limitations from the perspectives of architectural design, data, and evaluation protocols. Specifically, we identify two key challenges in current models: insufficient instruction compliance and background inconsistency. To this end, we propose MCIE-E1, a Multimodal Large Language Model-Driven Complex Instruction Image Editing method that integrates two key modules: a spatial-aware cross-attention module and a background-consistent cross-attention module. The former enhances instruction-following capability by explicitly aligning semantic instructions with spatial regions through spatial guidance during the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
