Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing

Honghao Cai; Xiangyuan Wang; Yunhao Bai; Haohua Chen; Tianze Zhou; Runqi Wang; Wei Zhu; Yibo Chen; Xu Tang; Yao Hu; Zhen Li

arXiv:2604.23763·cs.CV·May 1, 2026

Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing

Honghao Cai, Xiangyuan Wang, Yunhao Bai, Haohua Chen, Tianze Zhou, Runqi Wang, Wei Zhu, Yibo Chen, Xu Tang, Yao Hu, Zhen Li

PDF

TL;DR

AdaptEdit is a novel framework that enables precise, mask-free local image editing with diffusion transformers by integrating region-aware adapters and a region-focused training approach.

Contribution

It introduces a region-aware adapter framework that allows mask-free, localized editing in diffusion transformers without modifying the backbone.

Findings

01

Achieves state-of-the-art results on MagicBrush and Emu-Edit Test benchmarks.

02

Outperforms both mask-free and oracle-mask baselines in local image editing.

03

Component ablation confirms the effectiveness of each part of the framework.

Abstract

Large diffusion transformers (DiTs) follow global editing instructions well but consistently leak local edits into unrelated regions, because joint-attention architectures offer no explicit channel telling the network where to apply the edit. We introduce AdaptEdit, a co-trained, instruction- and region-aware adapter framework that retro-fits a frozen DiT into a precise local editor without modifying its backbone weights. A lightweight Block Adapter at every transformer block injects a structured condition stream that factorizes what to edit (instruction semantics) from where to edit (spatial mask); a learned SpatialGate routes the adapter signal selectively into the edit region while keeping the rest of the image near-identical to the source; and a Region-Aware Loss focuses the training objective on the changing pixels. Because these components make the backbone's internal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.