TL;DR
STaMP introduces a novel sequence transformation and mixed precision approach that enhances low-bit activation quantization in language and visual models, maintaining accuracy while reducing computational resources.
Contribution
It proposes applying linear transformations along sequence dimensions and mixed precision to improve low-bit activation quantization, a novel strategy in the field.
Findings
Significantly improves low-bit activation quantization accuracy.
Maintains model performance with reduced precision.
Complementary to existing quantization methods.
Abstract
Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models. However, accuracy often degrades sharply when activations are quantized below eight bits. Recent work suggests that invertible linear transformations (e.g. rotations) can aid quantization, by reparameterizing feature channels and weights. In this paper, we propose \textit{Sequence Transformation and Mixed Precision} (STaMP) quantization, a novel strategy that applies linear transformations along the \textit{sequence} dimension to exploit the strong local correlation in language and visual data. By keeping a small number of tokens in each intermediate activation at higher precision, we can maintain model accuracy at lower (average) activations bit-widths. We evaluate STaMP on recent LVM and LLM architectures, demonstrating that it significantly improves low bit width…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is really well-written in both figures, tabs, equations, and texts. 2. The paper contains experiments on both SD and LLM. 3. The method design features strong complementarity and is compatible with the existing technical system.
1. I a m confused with the argument in QuaRot that rotation only occurs in the feature dimension, because some weights (such as up/gate projection) are left-multiplied by the rotation matrix. 2. The experiments on Large Language Models is insufficient: (1) The model size is relatively small, and it has not been scaled up to the magnitude greater than 8B; (2) There are only PPL experiments, with the lack of mainstream zero-shot reasoning experiments, which are more crucial for verifying the effe
1. New Perspective: The paper introduces quantization transformations along the sequence dimension—a novel and orthogonal approach compared to existing methods. This direction effectively exploits temporal or spatial token correlations often overlooked by prior research. 2. Solid Theoretical Analysis: The work offers a well-defined mathematical treatment of quantization error and establishes a theoretical upper bound (Theorem 1) that connects token energy, bit allocation, and resulting error.
1. Missing Real-Time and Latency Evaluation: The paper does not report inference speed, latency, or hardware efficiency metrics—critical aspects for assessing quantization performance. Given that the multiplication of XL occurs online, it would be valuable to analyze its impact during inference. 2. Absence of Calibration Set Ablation: There is no investigation into how the size of the calibration dataset influences performance, which is an essential factor for post-training quantization (PTQ) m
1. High Novelty: The primary strength of this work lies in its high degree of originality. By shifting the focus of quantization transforms from the feature dimension to the sequence dimension, the authors introduce a new research paradigm. 2. Solid Experimental Results: The paper's claims are supported by consistent and significant performance improvements across a range of models (LLMs and LVMs), datasets, and strong baselines. 3. Excellent Complementarity: The experiments clearly demonstrat
1. Unaddressed Implementation Overhead: The most significant weakness is the failure to discuss or quantify the runtime overhead of the per-token mixed-precision scheme on current hardware. The lack of native hardware support may necessitate inefficient dequantization operations, which would severely impact the method's practical speedup. 2. Unquantified Transform Cost: While the computational cost of the DWT/DCT transform is theoretically small, the paper fails to provide a simple quantitative
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
