SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models

Xin Nie; Haicheng Zhang; Liang Dong; Beining Feng; Jinhong Weng; Guiling Sun

arXiv:2602.01027·cs.LG·February 3, 2026

SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models

Xin Nie, Haicheng Zhang, Liang Dong, Beining Feng, Jinhong Weng, Guiling Sun

PDF

Open Access

TL;DR

SFMP introduces a novel, search-free mixed-precision quantization framework for large language models that is hardware-friendly, enabling efficient compression and inference without expensive optimization.

Contribution

It proposes a continuous fractional bit-width approach, block-wise mixed-precision, weight reordering, and a unified GEMM kernel, advancing hardware-efficient quantization methods.

Findings

01

Outperforms state-of-the-art layer-wise methods under same memory constraints

02

Reduces quantization cost significantly

03

Improves inference efficiency

Abstract

Mixed-precision quantization is a promising approach for compressing large language models under tight memory budgets. However, existing mixed-precision methods typically suffer from one of two limitations: they either rely on expensive discrete optimization to determine precision allocation, or introduce hardware inefficiencies due to irregular memory layouts. We propose SFMP, a search-free and hardware-friendly mixed-precision quantization framework for large language models. The framework is built upon four novel ideas: Fractional bit-width, which extends integer bit-width for weight matrix to fractional value and transforms discrete precision allocation as a continuous problem; 2)Block-wise mixed-precision, enabling fine-grained precision within weight matrices while remaining hardware-friendly; 3)Row-column weight reordering, which aggregates salient weights via row and column…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Speech Recognition and Synthesis