Group Relative Attention Guidance for Image Editing

Xuanpu Zhang; Xuesong Niu; Ruidong Chen; Dan Song; Jianhao Zeng; Penghui Du; Haoxiang Cao; Kai Wu; An-an Liu

arXiv:2510.24657·cs.CV·December 1, 2025

Group Relative Attention Guidance for Image Editing

Xuanpu Zhang, Xuesong Niu, Ruidong Chen, Dan Song, Jianhao Zeng, Penghui Du, Haoxiang Cao, Kai Wu, An-an Liu

PDF

3 Reviews

TL;DR

This paper introduces Group Relative Attention Guidance (GRAG), a novel method that enables continuous, fine-grained control over image editing intensity in diffusion-based models, improving editing quality and precision.

Contribution

We propose GRAG, a simple technique that reweights token delta values to modulate editing focus, allowing precise control without additional tuning.

Findings

01

GRAG can be integrated with minimal code changes.

02

It enhances editing quality across various frameworks.

03

It provides smoother, more precise control than Classifier-Free Guidance.

Abstract

Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. The method is simple and training-free, and is supported by solid qualitative and quantitative evaluations across multiple editors, including an ablation study. 2. The method consistently produces more accurate editing results, effectively enhancing the performance of existing base models. 3. It effectively addresses the key trade-off between instruction fidelity and source preservation.

Weaknesses

1. The comparison is narrow. The paper's primary comparator is CFG, but it overlooks other relevant, training-free editing controls like attention reweighting, PnP Inversion, attention injection, and token-level gating. A broader comparison is essential to substantiate the claim of "uniquely precise and continuous control". 2. The guidance coefficient λ is treated as a singular, static parameter with no exploration of task, dataset, or timestep-dependent tuning. Furthermore, no formal link is es

Reviewer 02Rating 6Confidence 3

Strengths

- Applying GRAG across various benchmarks shows that, while some trade-offs exist, the improvement in editing performance is both noticeable and acceptable. Qualitative results further demonstrate that GRAG effectively follows the given editing instructions while maintaining the source image. - The ability to control the degree of editing by adjusting the GRAG scale highlights its high applicability. In particular, as shown in Table 2, GRAG offers more diverse and fine-grained control over edit

Weaknesses

- While the paper identifies a significant bias vector among token embeddings and demonstrates that modulating each token’s deviation from this bias can control editing strength, the assumption stated in lines 249–252 — that the bias vector represents a fixed “editing action” during the image editing process, while the variations of individual tokens relative to this bias correspond to the “content” being edited — lacks a concrete theoretical justification and is only supported empirically. - I

Reviewer 03Rating 6Confidence 3

Strengths

1. **Novel Insight and Simplicity:** The paper's primary strength lies in its novel observation and interpretation of the "bias vector" within DiT attention layers. This provides a new perspective on the internal mechanics of these models. The resulting method, GRAG, is elegant in its simplicity, requiring only a few lines of code to implement, which significantly lowers the barrier to adoption. 2. **Effective and Fine-Grained Control:** The qualitative and quantitative results (especially Fi

Weaknesses

1. **Indirect Manipulation of Attention:** The method manipulates key embeddings *before* the attention score calculation. An arguably more direct approach to control content contribution would be to modulate the attention weights themselves (i.e., the output of the softmax operation, or the logits before it). The paper does not provide a justification for why modulating the pre-attention embeddings is a superior or more principled choice compared to these more direct alternatives. 2. **Inter

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.