Is (Selective) Round-To-Nearest Quantization All You Need?
Alex Kogan

TL;DR
This paper demonstrates that simple Round-to-Nearest quantization is a cost-effective, practical, and competitive method for LLMs, often matching or surpassing more complex techniques in throughput and accuracy.
Contribution
The work shows RTN's effectiveness for LLM quantization, introduces improvements via selective precision increases, and challenges the notion that advanced methods are always superior.
Findings
RTN can achieve similar accuracy to advanced quantization methods.
RTN offers higher token generation throughput and lower computational cost.
Selective precision enhancement improves RTN performance.
Abstract
Quantization became a necessary tool for serving ever-increasing Large Language Models (LLMs). RTN (Round-to-Nearest) is perhaps the simplest quantization technique that has been around well before LLMs surged to the forefront of machine learning (ML) research. Yet, it has been largely dismissed by recent and more advanced quantization methods that claim superiority over RTN in nearly every aspect of performance. This work aims to dispel this established point of view, showing that RTN is not only much cheaper to apply, but also its token generation throughput can be better than and accuracy can be similar to more advanced alternatives. In particular, we discuss our implementation of RTN based on the recent Marlin kernels and demonstrate how the accuracy of RTN can be gradually improved by selectively increasing the data precision format of certain model layers and modules. Based on our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Natural Language Processing Techniques · Big Data and Digital Economy
MethodsMARLIN
