TL;DR
This paper introduces Group Relative Attention Guidance (GRAG), a novel method that enables continuous, fine-grained control over image editing intensity in diffusion-based models, improving editing quality and precision.
Contribution
We propose GRAG, a simple technique that reweights token delta values to modulate editing focus, allowing precise control without additional tuning.
Findings
GRAG can be integrated with minimal code changes.
It enhances editing quality across various frameworks.
It provides smoother, more precise control than Classifier-Free Guidance.
Abstract
Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The method is simple and training-free, and is supported by solid qualitative and quantitative evaluations across multiple editors, including an ablation study. 2. The method consistently produces more accurate editing results, effectively enhancing the performance of existing base models. 3. It effectively addresses the key trade-off between instruction fidelity and source preservation.
1. The comparison is narrow. The paper's primary comparator is CFG, but it overlooks other relevant, training-free editing controls like attention reweighting, PnP Inversion, attention injection, and token-level gating. A broader comparison is essential to substantiate the claim of "uniquely precise and continuous control". 2. The guidance coefficient λ is treated as a singular, static parameter with no exploration of task, dataset, or timestep-dependent tuning. Furthermore, no formal link is es
- Applying GRAG across various benchmarks shows that, while some trade-offs exist, the improvement in editing performance is both noticeable and acceptable. Qualitative results further demonstrate that GRAG effectively follows the given editing instructions while maintaining the source image. - The ability to control the degree of editing by adjusting the GRAG scale highlights its high applicability. In particular, as shown in Table 2, GRAG offers more diverse and fine-grained control over edit
- While the paper identifies a significant bias vector among token embeddings and demonstrates that modulating each token’s deviation from this bias can control editing strength, the assumption stated in lines 249–252 — that the bias vector represents a fixed “editing action” during the image editing process, while the variations of individual tokens relative to this bias correspond to the “content” being edited — lacks a concrete theoretical justification and is only supported empirically. - I
1. **Novel Insight and Simplicity:** The paper's primary strength lies in its novel observation and interpretation of the "bias vector" within DiT attention layers. This provides a new perspective on the internal mechanics of these models. The resulting method, GRAG, is elegant in its simplicity, requiring only a few lines of code to implement, which significantly lowers the barrier to adoption. 2. **Effective and Fine-Grained Control:** The qualitative and quantitative results (especially Fi
1. **Indirect Manipulation of Attention:** The method manipulates key embeddings *before* the attention score calculation. An arguably more direct approach to control content contribution would be to modulate the attention weights themselves (i.e., the output of the softmax operation, or the logits before it). The paper does not provide a justification for why modulating the pre-attention embeddings is a superior or more principled choice compared to these more direct alternatives. 2. **Inter
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
