XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

Thomas Witt

arXiv:2605.14844·cs.LG·May 15, 2026

XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

Thomas Witt

PDF

TL;DR

XFP is a novel adaptive weight quantizer for large language model inference that automatically optimizes codebook size and outlier handling without calibration, achieving high speed and accuracy.

Contribution

It introduces a dynamic, quality-targeted quantization method that automatically determines quantization parameters without manual tuning or calibration data.

Findings

01

XFP achieves 138 tokens/sec decoding speed on Qwen3.5-122B-A10B.

02

It maintains 94.49% GSM8K strict-match accuracy at high speed.

03

The H-Process enables fitting large models into memory while preserving accuracy.

Abstract

We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per-channel cosine similarity (one strict floor for attention and shared experts, one lazy floor for routed-expert MoE); XFP determines codebook size, outlier budget, and packing per layer automatically -- no Hessian, no calibration data, no manual bit-width selection. Each weight matrix is decomposed into a sparse fp16 outlier residual and a dense sub-byte index tensor into a per-group learned codebook. Two storage modes share one auto-select frontend and one fused decode kernel: V2 (per-channel Lloyd) and V2a (shared library of L=32 codebooks per layer). On Qwen3.5-122B-A10B under V2, XFP reaches 138 tok/s single-stream decode on workstation hardware (RTX PRO 6000 Blackwell, TP=2) at 94.49% GSM8K strict-match (3 seeds, n=3957),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.