A$^2$-Edit: Precise Reference-Guided Image Editing of Arbitrary Objects and Ambiguous Masks

Huayu Zheng; Guangzhao Li; Baixuan Zhao; Siqi Luo; Hantao Jiang; Guangtao Zhai; Xiaohong Liu

arXiv:2603.10685·cs.CV·March 23, 2026

A$^2$-Edit: Precise Reference-Guided Image Editing of Arbitrary Objects and Ambiguous Masks

Huayu Zheng, Guangzhao Li, Baixuan Zhao, Siqi Luo, Hantao Jiang, Guangtao Zhai, Xiaohong Liu

PDF

Open Access

TL;DR

A^2-Edit is a novel image editing framework that enables precise, reference-guided editing of arbitrary objects and ambiguous masks, supported by a large multi-category dataset and advanced modeling techniques.

Contribution

It introduces a unified inpainting model with a Mixture of Transformer module and a Mask Annealing Training Strategy, addressing dataset diversity and mask accuracy challenges.

Findings

01

Outperforms existing methods on benchmarks like VITON-HD and AnyInsertion.

02

Demonstrates robust editing across diverse object categories.

03

Enhances semantic transfer and generalization in image editing.

Abstract

We propose A^2-Edit, a unified inpainting framework for arbitrary object categories, which allows users to replace any target region with a reference object using only a coarse mask. To address the issues of severe homogenization and limited category coverage in existing datasets, we construct a large-scale multi-category dataset, UniEdit-500K, which includes 8 major categories, 209 fine-grained subcategories, and a total of 500,104 image pairs. Such rich category diversity poses new challenges for the model, requiring it to automatically learn semantic relationships and distinctions across categories. To this end, we introduce the Mixture of Transformer module, which performs differentiated modeling of various object categories through dynamic expert selection, and further enhances cross-category semantic transfer and generalization through collaboration among experts. In addition, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Cell Image Analysis Techniques