FPTQuant: Function-Preserving Transforms for LLM Quantization
Boris van Breugel, Yelysei Bondarenko, Paul Whatmough, Markus Nagel

TL;DR
FPTQuant introduces novel function-preserving transforms that enable efficient INT4 quantization of large language models, maintaining high accuracy with minimal overhead and achieving state-of-the-art speed-ups.
Contribution
The paper proposes four new function-preserving transforms tailored for transformer models, facilitating effective quantization without significant performance loss.
Findings
Enables static INT4 quantization with minimal overhead.
Achieves up to 3.9x speed-up over full-precision models.
Maintains or exceeds the accuracy of prior quantization methods.
Abstract
Large language models (LLMs) require substantial compute, and thus energy, at inference time. While quantizing weights and activations is effective at improving efficiency, naive quantization of LLMs can significantly degrade performance due to large magnitude outliers. This paper describes FPTQuant, which introduces four novel, lightweight, and expressive function-preserving transforms (FPTs) to facilitate quantization of transformers: (1) a mergeable pre-RoPE transform for queries and keys, (2) a mergeable transform for values, (3) a mergeable scaling transform within the MLP block, and (4) a cheap, dynamic scaling transform. By leveraging the equivariances and independencies inherent to canonical transformer operation, we designed these FPTs to maintain the model's function while shaping the intermediate activation distributions to be more quantization friendly. FPTQuant requires no…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The pre-RoPE transformation proposed by the authors elegantly solves the merging problem of RoPE in quantization, achieving "functional equivalence" through-RoPE, a first in existing methods. 2. Applying RMSNorm to the residual path equivalently introduces an approximate dynamic normalization mechanism, which is beneficial to quantization stability. 3. A comprehensive comparison was conducted on models such as LLaMA 2, LLaMA 3, and 3B instruct, covering various quantization bit widths and s
1. Lack of theoretical guarantees for optimization: The rotation matrix of Pre-RoPE is obtained only through local optimization by minimizing the L4 norm; the paper does not provide convergence or global optimality analysis. 2. Lack of distribution validation for residual scaling: Although the authors claim that this mechanism can reduce the amplitude difference between tokens, the paper does not provide activation distribution or outlier visualization, making it difficult to intuitively unders
1. Minimal Inference Overhead with Mergeable Transforms. Most FPTs (e.g., pre-RoPE for queries/keys, per-head value transform) can be merged into existing model weights, avoiding extra computational cost or custom kernels during inference, which is critical for practical LLM deployment, especially on edge devices 2. Leverages transformer equivariances/independencies and two-stage optimization, local L_p-norm minimization + end-to-end student-teacher training to reshape activation distributions,
1. For very challenging setups (e.g., W4A4KV4 on Llama 2 7B), FPTQuant’s accuracy gap with FlatQuant widens, especially in zero-shot reasoning—indicating limitations in handling severe quantization pressure 2. While inference is lightweight, FPTQuant’s two-stage optimization (local L_p-norm minimization + end-to-end student-teacher training) adds more training steps than simpler PTQ methods (e.g., RTN-opt); even with mergeable transforms, training larger models (e.g., Llama 2 7B) takes longer th
1. This article achieves int4 static quantization through dynamic per-token scaling and pre-RoPE transformation. 2. Detailed ablation experiments. The article conducts sufficient ablation on each type of transformation to prove the effectiveness of each component.
1. The innovation of some components in this article is limited. OSTQuant also uses per-head invertible matrices and completes supervision using the full probability labels of the teacher model. 2. This article achieves a similar inference speed to SpinQuant, but its accuracy is lower than that of SOTA models such as FlatQuant, which limits its overall contribution. 3. The typesetting of the paper needs improvement. For example, in Figure 1, the color scheme makes it difficult to read. 4. Fig
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
