Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Chun-Hsiao Yeh; Yilin Wang; Nanxuan Zhao; Richard Zhang; Yuheng Li; Yi Ma; Krishna Kumar Singh

arXiv:2507.05259·cs.CV·July 8, 2025

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Chun-Hsiao Yeh, Yilin Wang, Nanxuan Zhao, Richard Zhang, Yuheng Li, Yi Ma, Krishna Kumar Singh

PDF

1 Video

TL;DR

X-Planner is a novel system that uses chain-of-thought reasoning with multimodal large language models to decompose complex image editing instructions into manageable steps, enabling precise, identity-preserving edits without manual masks.

Contribution

We introduce X-Planner, a planning framework that automates instruction decomposition and mask generation, significantly improving complex image editing accuracy and automation.

Findings

01

Achieves state-of-the-art results on existing benchmarks.

02

Effectively handles complex, indirect instructions.

03

Reduces manual intervention in image editing workflows.

Abstract

Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing· underline