LSRS: Latent Scale Rejection Sampling for Visual Autoregressive Modeling
Hong-Kai Zheng, Piji Li

TL;DR
This paper introduces LSRS, a method that refines latent token maps during inference in visual autoregressive models, significantly improving image generation quality with minimal extra computation.
Contribution
We propose Latent Scale Rejection Sampling (LSRS), a novel inference technique that enhances VAR models by refining token maps to reduce structural errors.
Findings
LSRS reduces FID score from 1.95 to 1.78 with only 1% increase in inference time.
LSRS further reduces FID to 1.66 with a 15% increase in inference time.
LSRS effectively mitigates autoregressive errors while maintaining computational efficiency.
Abstract
Visual Autoregressive (VAR) modeling approach for image generation proposes autoregressive processing across hierarchical scales, decoding multiple tokens per scale in parallel. This method achieves high-quality generation while accelerating synthesis. However, parallel token sampling within a scale may lead to structural errors, resulting in suboptimal generated images. To mitigate this, we propose Latent Scale Rejection Sampling (LSRS), a method that progressively refines token maps in the latent scale during inference to enhance VAR models. Our method uses a lightweight scoring model to evaluate multiple candidate token maps sampled at each scale, selecting the high-quality map to guide subsequent scale generation. By prioritizing early scales critical for structural coherence, LSRS effectively mitigates autoregressive error accumulation while maintaining computational efficiency.…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is well-written and easy to follow. 2. The experiments and ablation studies are detailed and comprehensive.
1. The paper does not discuss the generality of the proposed method. Since all experiments are conducted on ImageNet, the scoring model is also trained on ImageNet, which is in-domain data. Can a scoring model trained on ImageNet be effectively applied to other domains, such as text-to-image models? Or, to adapt this method for text-to-image generation, would it be necessary to retrain the scoring model on domain-specific data? 2. The structural error problem in parallel decoding is not first id
1. The paper identifies a key limitation of current VAR models: parallel sampling many tokens within a scale in a single step may brings a degradation in the quality of generation. 2. The LSRS introduces the rejection sampling in the latent space of multi-scale autoregressive models, and it is simple and lightweight, making it highly practical for deployment. 3. The propose LSRS method improve the generation quality than VAR, which reduce its FID score from 1.95 to 1.78 while increasing the infe
1. The issue in this paper is that parallel token sampling within a scale may lead to structural errors, resulting in suboptimal generated images. To a certain extent, LSRS reduces the structural errors in the generation process, but it did not directly address the root cause of the problem, parallel sampling many tokens. Instead, It merely selected the best result from the candidate pool. 2. Figure 1 does not fully illustrate the issue that parallel token sampling within a scale may lead to str
- The analysis of VAR’s mechanisms and inherent limitations provides valuable insights, revealing that earlier scales play a more critical role in determining overall image structure. - The proposed latent scale rejection sampling (LSRS) method is technically well-founded, combining real and synthetic data construction, scoring model training, and token map selection guided by the scoring model. - Extensive experiments on the ImageNet image generation task demonstrate the effectiveness of LSRS a
- The relationship between the imperfect parallel sampling mechanism and the proposed LSRS sampling method is not clearly explained. Since the base VAR models remain unchanged and LSRS just runs the base models several times for certain latent scales, it is unclear how LSRS effectively mitigates the limitations of the mutually independent token sampling mechanism. - While the ImageNet experiments are sufficient to demonstrate the effectiveness of the proposed method, the paper would be further s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Computer Graphics and Visualization Techniques
