NEP: Autoregressive Image Editing via Next Editing Token Prediction
Huimin Wu, Xiaojian Ma, Haozhe Zhao, Yanpeng Zhao, Qing Li

TL;DR
This paper introduces NEP, a novel autoregressive image editing method that selectively regenerates only the edited regions, reducing unnecessary computation and improving edit quality, with a pre-trained model capable of zero-shot editing and iterative refinement.
Contribution
It proposes Next Editing-token Prediction (NEP) for targeted image editing and a pre-trained autoregressive T2I model for zero-shot and any-region editing, achieving state-of-the-art results.
Findings
NEP outperforms existing methods on image editing benchmarks.
The pre-trained model enables zero-shot and iterative test-time scaling.
Selective editing reduces unnecessary computation and preserves non-edited regions.
Abstract
Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Multimodal Machine Learning Applications
