Visual Autoregressive Modeling for Instruction-Guided Image Editing
Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, Tao Mei

TL;DR
VAREdit introduces a visual autoregressive framework for instruction-guided image editing that improves adherence and efficiency over diffusion models by predicting multi-scale features conditioned on source images and instructions.
Contribution
The paper proposes VAREdit, a novel autoregressive approach with a Scale-Aligned Reference module, enabling precise, fast image editing guided by text instructions.
Findings
Outperforms diffusion-based methods on EMU-Edit and PIE-Bench benchmarks.
Achieves 2.2x faster editing speed than UltraEdit.
Demonstrates superior editing adherence and efficiency.
Abstract
Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper explores how to apply VAR architecture to image editing, moving away from dominant diffusion approaches. The SOTA results on balanced metrics and 2.2x speedup are significant, suggesting AR models are a highly promising direction for image editing. 2. The core contribution is the deep analysis of the scale mismatch problem in VAR conditioning. The attention map analysis (Figure 3) is strong evidence. The resulting SAR module is an elegant and efficient solution that precisely targe
1. The paper's primary metrics (GPT-Suc., GPT-Over., GPT-Bal.) rely on GPT-4o as a judge. While arguably better than CLIP, this is costly, slow, and dependent on a proprietary API, making the evaluation results difficult and expensive to reproduce. The authors should provide an alternative using an open-source VLM (such as Qwen3-VL) in the rebuttal. 2. The authors used CLIP and GPT scores as evaluation metrics, but lacked test results on benchmarks commonly used in the editing field, such as Img
1. The paper analyzes and highlights the limitations of both full-scale and finest-scale-only image conditioning for autoregressive editing tasks, a nuanced challenge not prominently addressed in prior work. 2. The proposed Scale-Aligned Reference module is conceptually well-motivated, solving the scale-mismatch issue in a resource-efficient way by using scale-aligned source features solely in the first attention layer. The supporting evidence includes explicit self-attention heatmaps and precis
1. The Related Work underappreciates or omits explicit discussion of closely related contemporaneous models for instruction-based or in-context image editing. [1,2,3] are especially relevant and directly relate to the manuscript’s focus and should be discussed in both the Related Work section and empirical comparisons if possible. 2. The approach is mostly evaluated in the standard fine-tuned regime and does not examine how VAREdit generalizes to unseen editing types, tasks, or user instruction
- The paper introduces a novel method called Scale-Aligned Reference (SAR), which effectively balances computational efficiency and editing performance. - The proposed method achieves strong performance on EMU-Edit and PIE-Bench, with moderate latency, outperforming several diffusion-based and autoregressive (AR) baselines. - The paper includes comprehensive experimental analysis; it conducts attention-level investigations to motivate the design of SAR, provides both qualitative and quantitativ
- the paper provides self-attention heatmaps based on the full-scale setting to motivate the design of SAR, but a similar analysis of the tuned model is missing; including this will cross-validate the modeling choice and strengthen the motivation. - the paper relies on GPT score to evaluate editing models; however, GPT judges may introduce hallucinations and prompt-induced biases. A better way would be to use a controlled human study to cross-validate the reliability of using GPT as the evaluato
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques · Digital Humanities and Scholarship
