EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, Songhua Liu

TL;DR
This paper introduces EditMGT, a novel image editing framework based on Masked Generative Transformers, which enables precise local edits while preserving non-target regions, outperforming diffusion models in quality and speed.
Contribution
The paper presents the first MGT-based image editing framework, utilizing attention maps for localization and a new sampling method to confine edits, with a high-resolution dataset for training.
Findings
Achieves 6x faster editing than diffusion models
Improves style change accuracy by 3.6%
Enhances style transfer performance by 17.6%
Abstract
Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
+ The proposal takes a text-to-image masked generative transformer and repurpose it for the task of instruction based image editing. The guidelines and best practices identified in the paper are likelly reusable when/if new and better masked generative transformers become available + The result of this work is an image editing model that is significantly cheaper to run and requires a fraction of the memory compared to most competitors. It achieves so without sacrificing too much the quality of
## Major a. **Entanglement of contribution/missing ablations**: the paper has 2 main contributions: a novel dataset for instruction based image editing and a method to turn masked generative transformers into image editing models. When the two are combined the paper shows that they are able to generate a model which is small but competitive at the task, however it is not clear what are the relative contributions of data vs architecture wrt this achievement. In particular the paper would have be
This paper employs masked generative transformers for image editing. The nature of MGTs, which process local patches for image generation, allows it to better preserve unedited regions in image editing tasks. The idea is logical, and the framework is pioneering. Experimental results also demonstrate the potential of this method: it achieves state-of-the-art performance comparable to large models while requiring much fewer parameters.
- The absence of an ablation study makes it unclear how much the proposed inference-time techniques, i.e., attention consolidation and region-hold sampling, contribute to efficiency and final metrics. I am concerned that these methods may introduce additional computational overhead. - There is an issue of fairness in the comparative experiments. The paper mentions the use of CrispEdit-2M, a specially collected high-quality and high-resolution dataset, which may have significantly contributed t
- The paper presents an innovative use of Masked Generative Transformers for image editing, offering a fresh alternative to diffusion models. - The attention injection mechanism enables efficient conditioning without additional parameters, making the method lightweight and practical. - The combination of attention consolidation and region-hold sampling provides precise edit localization and strong preservation of unedited areas. - Experiments across several benchmarks show that EditMGT delivers
- The paper lacks detailed ablation studies that isolate the effect of each component on the final performance. - The robustness of the approach is not deeply analyzed, especially for challenging cases such as small or overlapping objects, fine-grained texture edits (e.g., 'add a small cat logo on the cup'), or ambiguous text prompts (e.g., 'make it blue' -- in a scene we have multiple (two or three) objects) where localization may fail. - Some quantitative metrics, such as L1, show inconsistent
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship · Computer Graphics and Visualization Techniques
