TL;DR
ExtraVAR introduces a stage-aware RoPE remapping and adaptive attention calibration to improve high-resolution image synthesis in visual autoregressive models, addressing failure modes caused by scale-wise band mismatches.
Contribution
The paper proposes a novel, training-free method for resolution extrapolation in VAR models that suppresses failure modes by remapping frequency bands and calibrates attention dispersion adaptively.
Findings
Outperforms prior methods in structural coherence.
Enhances fine-detail fidelity at higher resolutions.
Effectively suppresses repetition and detail degradation.
Abstract
Visual Autoregressive (VAR) models have emerged as a strong alternative to diffusion for image synthesis, yet their fixed training resolution prevents direct generation at higher resolutions. Naively transferring training-free extrapolation methods from LLMs or diffusion models to VAR yields three characteristic failure modes: global repetition, local repetition, and detail degradation. We trace them to a unified band-stage mismatch: VAR generates images in a coarse-to-fine, scale-wise process where each stage is driven by a distinct dominant RoPE frequency band, and each failure mode emerges when the dominant band of a particular stage is disrupted. Building on this insight, we propose Stage-Aware RoPE Remapping, a training-free strategy that assigns each frequency band a stage-specific remapping rule, jointly suppressing all three failure modes. We further observe that attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
