FBQuant: FeedBack Quantization for Large Language Models
Yijiang Liu, Hengyu Fang, Liulu He, Rongyu Zhang, Yichuan Bai, Yuan Du, Li Du

TL;DR
FBQuant is a novel weight quantization method for large language models that reduces accuracy loss and inference latency on edge devices by incorporating feedback mechanisms and efficient CUDA kernels.
Contribution
We introduce FBQuant, a feedback-inspired quantization technique with optimized CUDA implementation, improving accuracy and efficiency for LLM deployment on edge devices.
Findings
Improves 3-bit Llama2-7B zero-shot accuracy by 1.2%.
Reduces extra inference time by 60% with CUDA kernel.
Effectively maintains weight bounds, reducing overfitting risk.
Abstract
Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to the limited computational resources of edge devices. In particular, the key bottleneck stems from memory bandwidth constraints related to weight loading. Weight-only quantization effectively reduces memory access, yet often induces significant accuracy degradation. Recent efforts to incorporate sub-branches have shown promise for mitigating quantization errors, but these methods either lack robust optimization strategies or rely on suboptimal objectives. To address these gaps, we propose FeedBack Quantization (FBQuant), a novel approach inspired by negative feedback mechanisms in automatic control. FBQuant inherently ensures that the reconstructed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
