XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference
Thomas Witt

TL;DR
XFP is a novel adaptive weight quantizer for large language model inference that automatically optimizes codebook size and outlier handling without calibration, achieving high speed and accuracy.
Contribution
It introduces a dynamic, quality-targeted quantization method that automatically determines quantization parameters without manual tuning or calibration data.
Findings
XFP achieves 138 tokens/sec decoding speed on Qwen3.5-122B-A10B.
It maintains 94.49% GSM8K strict-match accuracy at high speed.
The H-Process enables fitting large models into memory while preserving accuracy.
Abstract
We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per-channel cosine similarity (one strict floor for attention and shared experts, one lazy floor for routed-expert MoE); XFP determines codebook size, outlier budget, and packing per layer automatically -- no Hessian, no calibration data, no manual bit-width selection. Each weight matrix is decomposed into a sparse fp16 outlier residual and a dense sub-byte index tensor into a per-group learned codebook. Two storage modes share one auto-select frontend and one fused decode kernel: V2 (per-channel Lloyd) and V2a (shared library of L=32 codebooks per layer). On Qwen3.5-122B-A10B under V2, XFP reaches 138 tok/s single-stream decode on workstation hardware (RTX PRO 6000 Blackwell, TP=2) at 94.49% GSM8K strict-match (3 seeds, n=3957),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
