Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

Tao Xia; Jiawei Liu; Yukun Zhang; Ting Liu; Wei Wang; Lei Zhang

arXiv:2603.28367·cs.CV·March 31, 2026

Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

Tao Xia, Jiawei Liu, Yukun Zhang, Ting Liu, Wei Wang, Lei Zhang

PDF

TL;DR

This paper introduces a novel text-guided image editing framework using visual autoregressive models that improves structural consistency and editing fidelity through a coarse-to-fine localization and adaptive feature injection.

Contribution

The work presents a new approach that enhances structure preservation and editing accuracy in VAR-based image editing by analyzing intermediate features and employing reinforcement learning for feature injection.

Findings

01

Achieves superior structural consistency compared to state-of-the-art methods.

02

Balances editing fidelity and background preservation effectively.

03

Demonstrates improved results in both local and global editing scenarios.

Abstract

Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.