I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

Jinghan Yu; Junhao Xiao; Chenyu Zhu; Jiaming Li; Jia Li; HanMing Deng; Xirui Wang; Guoli Jia; Jianjun Li; Xiang Bai; Bowen Zhou; Zhiyuan Ma

arXiv:2601.03741·cs.CV·April 8, 2026

I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

Jinghan Yu, Junhao Xiao, Chenyu Zhu, Jiaming Li, Jia Li, HanMing Deng, Xirui Wang, Guoli Jia, Jianjun Li, Xiang Bai, Bowen Zhou, Zhiyuan Ma

PDF

TL;DR

I2E introduces a structured, object-based approach to text-guided image editing, overcoming limitations of pixel-centric methods by decomposing images into objects and reasoning through atomic actions.

Contribution

The paper presents I2E, a novel framework that decomposes images into objects and uses a physics-aware agent for precise, multi-object editing based on complex instructions.

Findings

01

I2E outperforms existing methods on complex compositional editing tasks.

02

I2E maintains physical plausibility and multi-turn editing stability.

03

I2E-Bench provides a new benchmark for spatial reasoning and high-precision editing.

Abstract

Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.