Semantic Context Matters: Improving Conditioning for Autoregressive Models

Dongyang Jin; Ryan Xu; Jianhao Zeng; Rui Lan; Yancheng Bai; Lei Sun; Xiangxiang Chu

arXiv:2511.14063·cs.CV·March 17, 2026

Semantic Context Matters: Improving Conditioning for Autoregressive Models

Dongyang Jin, Ryan Xu, Jianhao Zeng, Rui Lan, Yancheng Bai, Lei Sun, Xiangxiang Chu

PDF

Open Access

TL;DR

SCAR enhances autoregressive image models by incorporating semantic context through a compact prefix and alignment guidance, significantly improving instruction adherence and visual quality in image editing tasks.

Contribution

Introduces SCAR, a novel semantic-context-driven method that improves conditioning in autoregressive models for image editing, overcoming previous semantic limitations and high computational costs.

Findings

01

Outperforms prior AR-based methods in visual fidelity and semantic alignment

02

Achieves superior results on instruction editing and controllable generation benchmarks

03

Maintains flexibility across different autoregressive paradigms

Abstract

Recently, autoregressive (AR) models have shown strong potential in image generation, offering better scalability and easier integration with unified multi-modal systems compared to diffusion-based methods. However, extending AR models to general image editing remains challenging due to weak and inefficient conditioning, often leading to poor instruction adherence and visual artifacts. To address this, we propose SCAR, a Semantic-Context-driven method for Autoregressive models. SCAR introduces two key components: Compressed Semantic Prefilling, which encodes high-level semantics into a compact and efficient prefix, and Semantic Alignment Guidance, which aligns the last visual hidden states with target semantics during autoregressive decoding to enhance instruction fidelity. Unlike decoding-stage injection methods, SCAR builds upon the flexibility and generality of vector-quantized-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning