PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models
Xuewen Liu, Zhikai Li, Jing Zhang, Mengjuan Chen, and Qingyi Gu

TL;DR
This paper introduces PTQ4ARVG, a novel post-training quantization framework for autoregressive visual generation models that effectively reduces model size to 6-8 bits without significant performance loss.
Contribution
It proposes a training-free quantization method addressing key challenges in ARVG models, including outliers, token-wise variance, and distribution mismatch, with novel techniques like GPS, STWQ, and DGC.
Findings
Effective 8-bit and 6-bit quantization of ARVG models.
Maintains competitive performance post-quantization.
Addresses key quantization challenges in autoregressive visual models.
Abstract
AutoRegressive Visual Generation (ARVG) models retain an architecture compatible with language models, while achieving performance comparable to diffusion-based models. Quantization is commonly employed in neural networks to reduce model size and computational latency. However, applying quantization to ARVG remains largely underexplored, and existing quantization methods fail to generalize effectively to ARVG models. In this paper, we explore this issue and identify three key challenges: (1) severe outliers at channel-wise level, (2) highly dynamic activations at token-wise level, and (3) mismatched distribution information at sample-wise level. To these ends, we propose PTQ4ARVG, a training-free post-training quantization (PTQ) framework consisting of: (1) Gain-Projected Scaling (GPS) mitigates the channel-wise outliers, which expands the quantization loss via a Taylor series to…
Peer Reviews
Decision·ICLR 2026 Poster
* The proposed Gain-Projected Scaling (GPS) offers a principled mechanism for balancing weight and activation scaling by leveraging Hessian information rather than relying on empirical heuristics. This theoretically grounded design suggests strong potential for broader applicability beyond ARVG models. * Experimental results consistently show that the proposed method achieves significant improvements over baseline quantization approaches, underscoring its effectiveness and practical value.
* Achieving the optimal GPS configuration appears challenging compared to empirical estimation. The approach involves several approximations, including those for the Hessian matrix, the upper bound of overall quantization loss, and scaling quantization error. These approximations may introduce discrepancies across different input distributions and norms. Moreover, the final GPS algorithm computes scaling based only on the most significant channel, which may not reflect a global—or even local—opt
1. Problem formulation is clear and well targeted, The authors carefully analyze how ARVG differs from standard LLMs and diffusion models in activation distributions and token structure, The three proposed challenges at the channel, token, and sample levels capture the essential difficulties of quantizing ARVG, providing a solid basis for method design. 2. Method design has theoretical support, The GPS component is not purely heuristic, it derives an analytic expression for the effect of the sca
1. Several important approximations and assumptions in GPS are not sufficiently validated, GPS omits Hessian cross terms in its derivation, however prior series of quantization works (OBD[1], OBS[2], OBC[3], GPTQ[4]) have pointed out that such omissions can introduce significant errors. 2. The position-invariance assumption underlying STWQ is limited, The authors rely on “position-invariant distributions” as the core justification for STWQ, but the paper only shows statistics for a few layers an
1. High Relevance and Motivation: The work addresses a critical bottleneck—quantization—for deploying large generative models, filling a significant research gap within the emerging ARVG model class. 2. Clarity and Technical Detail: The paper clearly articulates the unique quantization challenges specific to ARVG and proposes technically sound solutions, such as GPS, which optimizes the scaling factor using a closed-form solution derived from a Taylor series expansion. 3. Extensive Robustness
The generative evaluation is confined to the ImageNet dataset and relies heavily on traditional metrics like FID/IS which correlate poorly with human perception, suggesting that the inclusion of modern perceptual metrics (e.g., HPS [1] or CLIP Score) on diverse, high-fidelity datasets is strongly recommended for a more robust comparison. [1] Ma Y, Wu X, Sun K, et al. Hpsv3: Towards wide-spectrum human preference score. CVPR 2025: 15086-15095.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
