SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation
Youngwoo Shin, Jiwan Hur, Junmo Kim

TL;DR
This paper introduces Scaled Spatial Guidance (SSG), a training-free inference technique that enhances multi-scale visual autoregressive models by emphasizing high-frequency details, improving image fidelity and diversity without increasing latency.
Contribution
The paper proposes SSG, a novel inference-time guidance method that maintains the coarse-to-fine hierarchy in VAR models by isolating and emphasizing high-frequency signals, improving image quality.
Findings
SSG improves image fidelity and diversity in VAR models.
SSG maintains low latency during inference.
The method is broadly applicable across different tokenization and conditioning modalities.
Abstract
Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper identifies that the generation order of VAR models implicitly forms a scale-guidance structure and introduces a training-free guidance method that exploits this property. - The proposed approach is grounded in the information bottleneck perspective, and the authors empirically support that the method can correct distorted signals. - Extensive experiments across various models and datasets demonstrate consistent improvements, highlighting the robustness and generality of the method.
- While the method appears generic, the presentation is heavily specialized for VAR, making the broader applicability unclear. - The approach, although training-free, feels ad-hoc and may not fundamentally address scalability and representation issues in visual tokenization; it would strengthen the contribution to explore how SSG could be incorporated into tokenization or model design directly. - The abstract emphasizes improving high-frequency details, but the validation for this claim is limit
1. SSG works purely at inference on logits, requiring no retraining, fine-tuning, or architectural changes. 2. The method is derived from an information bottleneck perspective, offering a principled justification for why emphasizing high-frequency residuals improves coarse-to-fine generation. 3. The frequency-domain DSE module is simple and elegant. 4. Performance is good. SSG consistently improves VAR across various model sizes and input resolution, with only negelectable cost.
1. The proposed method is closely tied to the coarse-to-fine next-scale structure of VAR models. While this focus is well-motivated, it somewhat limits the generality of the contribution. It remains unclear whether SSG can extend to broader autoregressive or diffusion-based generation frameworks. Discussing how the information-theoretic insights or the frequency-domain prior could inspire guidance mechanisms beyond VAR would strengthen the paper’s broader impact. 2. Most examples highlight succ
SSG is a one-line logit update with a clear algorithmic recipe (Alg. 1–2) and no retraining or architectural changes; it operates directly on residual logits and drops into a wide range of VAR-style decoders.
SSG feels close to existing guidance/contrastive logit-shaping (e.g., CFG-like sharpening, residual boosting), with limited conceptual leap. The motivation for DSE/SSG leans on orthonormal transforms yielding “independent and non-interfering” bands, enabling “exact, lossless reconstruction” and a clean separation of low/high frequencies for prior construction. That’s a strong assumption in learned logit spaces and likely violated by aliasing, context coupling, and tokenization quirks; the paper
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Domain Adaptation and Few-Shot Learning
