Learning under Quantization for High-Dimensional Linear Regression
Dechen Zhang, Junwei Su, Difan Zou

TL;DR
This paper provides the first rigorous theoretical analysis of how various quantization schemes impact high-dimensional linear regression training, revealing their effects on noise, data spectrum, and learning risk.
Contribution
It introduces a novel analytical framework that characterizes the influence of different quantization types on learning performance in linear regression.
Findings
Additive quantization suppresses noise amplification scaled by batch size.
Multiplicative quantization preserves spectral structure, reducing distortion.
Quantitative risk comparison under polynomial-decay data spectra.
Abstract
The use of low-bit quantization has emerged as an indispensable technique for enabling the efficient training of large-scale models. Despite its widespread empirical success, a rigorous theoretical understanding of its impact on learning performance remains notably absent, even in the simplest linear regression setting. We present the first systematic theoretical study of this fundamental question, analyzing finite-step stochastic gradient descent (SGD) for high-dimensional linear regression under a comprehensive range of quantization targets: data, label, parameter, activation, and gradient. Our novel analytical framework establishes precise algorithm-dependent and data-dependent excess risk bounds that characterize how different quantization affects learning: parameter, activation, and gradient quantization amplify noise during training; data quantization distorts the data spectrum…
Peer Reviews
Decision·ICLR 2026 Poster
The author positions this work in the context of LLMs and low-precision training, precision-scaling laws, and benign overfitting and overparameterization theory -- which is pretty relevant in today's ML landscape where bigger models and more data are preferred for better performance. The paper tackles a pretty fundamental question: How does quantization affect the generation performance of SGD for linear regression, in contrast to prior works that focus on convergence of optimization algorithms
Firstly, I have a semi-major concern: The paper implicitly assumes quantization is purely detrimental and focuses on bounded degradation relative to full precision. But quantization introduces stochasticity that also plays a role analogous to implicit regularization similar to SGD noise, weight-decay, etc. (Ref: https://arxiv.org/abs/2101.12176). There are also some empirical works in deep-learning where low-precision improves generalization slightly (e.g., https://arxiv.org/abs/2206.12372). The
1. The authors provide a unified framework that decomposes the excess risk of quantized SGD into interpretable components. 2. The paper is mathematically solid and demonstrates excess risk bounds explicitly with clear decomposition and scaling behavior under both additive and multiplicative quantization. 3. The authors provide interesting insights into precision and generalization trade off, and also provide theoretical link between quantization type, like FP vs. INT, can be beneficial for scali
1. The experimental section is only on synthetic Gaussian data, it could be more convincing if validated on real world dataset. 2. The analysis relies on idealized assumptions, such as unbiased stochastic quantization, but these may not hold in practical low-precision systems. Discussion or relaxation of these assumptions would strengthen the generality. 3. While motivated by scaling-law literature, the link between derived quantization effects and empirical scaling behaviors remains largely qua
* The paper brings a theoretical analysis of different quantization targets. * The definition, assumptions, and notations are easy to follow.
* The statement of Theorem 4.1 is difficult to analyze and to derive guidance on what to quantize to maximize performance within a limited resource budget; the paper would benefit from organizing and grouping the results and from plugging in an optimal learning rate. * The bounds in Theorem 4.1 do not improve with the number of samples $N$, which is atypical for SGD analyses (even in more general convex or non-convex settings). This issue remains even without compression (i.e., when $\varepsilon
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Data Compression Techniques · Advanced Neural Network Applications
