FBQuant: FeedBack Quantization for Large Language Models

Yijiang Liu; Hengyu Fang; Liulu He; Rongyu Zhang; Yichuan Bai; Yuan Du; Li Du

arXiv:2501.16385·cs.LG·May 26, 2025

FBQuant: FeedBack Quantization for Large Language Models

Yijiang Liu, Hengyu Fang, Liulu He, Rongyu Zhang, Yichuan Bai, Yuan Du, Li Du

PDF

Open Access

TL;DR

FBQuant is a novel weight quantization method for large language models that reduces accuracy loss and inference latency on edge devices by incorporating feedback mechanisms and efficient CUDA kernels.

Contribution

We introduce FBQuant, a feedback-inspired quantization technique with optimized CUDA implementation, improving accuracy and efficiency for LLM deployment on edge devices.

Findings

01

Improves 3-bit Llama2-7B zero-shot accuracy by 1.2%.

02

Reduces extra inference time by 60% with CUDA kernel.

03

Effectively maintains weight bounds, reducing overfitting risk.

Abstract

Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to the limited computational resources of edge devices. In particular, the key bottleneck stems from memory bandwidth constraints related to weight loading. Weight-only quantization effectively reduces memory access, yet often induces significant accuracy degradation. Recent efforts to incorporate sub-branches have shown promise for mitigating quantization errors, but these methods either lack robust optimization strategies or rely on suboptimal objectives. To address these gaps, we propose FeedBack Quantization (FBQuant), a novel approach inspired by negative feedback mechanisms in automatic control. FBQuant inherently ensures that the reconstructed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques