AttnRouter: Per-Category Attention Routing for Training-Free Image Editing on MMDiT
Guandong Li, Mengxia Ye

TL;DR
This paper introduces a training-free image editing method using a multi-modal diffusion transformer, with novel attention manipulation and routing techniques that improve editing fidelity and preserve source structure.
Contribution
It proposes KVInject for simplified attention manipulation, AttnRouter for per-category routing, and localizes effective attention sub-circuits for image editing.
Findings
KVInject avoids prompt-mismatch failure and simplifies attention manipulation.
AttnRouter improves editing accuracy by dispatching to optimal attention operations.
Injection in early denoising steps recovers most editing gains.
Abstract
We study training-free image editing on Qwen-Image-Edit-2511, a 60-block multi-modal diffusion transformer (MMDiT) that concatenates noise and source-image tokens within a single attention stream. We make three contributions. (i) We introduce KVInject, a single-forward attention manipulation that alpha-blends source-half key/value projections into the noise-half within a localized layer/step band. KVInject is simpler than the classical two-pass MasaCtrl recipe and avoids the prompt-mismatch failure mode that disables MasaCtrl on MMDiT (composite score drops 31% versus baseline). (ii) We show that no single attention operation dominates across edit types, motivating AttnRouter, a per-category routing table that dispatches edits to the operation that best preserves source structure for that type. With ground-truth categories the router improves the CLIP-T+DINO-I composite by 6.4% over the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
