TL;DR
This paper introduces Masked Logit Nudging, a novel prompt-guided image editing method for visual autoregressive models that improves editing accuracy, reconstruction quality, and speed compared to previous approaches.
Contribution
It proposes Masked Logit Nudging, a guidance technique that aligns model predictions with source image tokens for precise, efficient image editing and reconstruction.
Findings
Achieves state-of-the-art performance on the PIE benchmark.
Outperforms previous VAR-based methods and rivals diffusion models in quality.
Provides faster image editing and reconstruction than existing methods.
Abstract
We address the problem of prompt-guided image editing in visual autoregressive models. Given a source image and a target text prompt, we aim to modify the source image according to the target prompt, while preserving all regions which are unrelated to the requested edit. To this end, we present Masked Logit Nudging, which uses the source image token maps to introduce a guidance step that aligns the model's predictions under the target prompt with these source token maps. Specifically, we convert the fixed source encodings into logits using the VAR encoding, nudging the model's predicted logits towards the targets along a semantic trajectory defined by the source-target prompts. Edits are applied only within spatial masks obtained through a dedicated masking scheme that leverages cross-attention differences between the source and edited prompts. Then, we introduce a refinement to correct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
