Accuracy is Not All You Need
Abhinav Dutta, Sanjeev Krishnan, Nipun Kwatra, Ramachandran Ramjee

TL;DR
This paper reveals that model accuracy alone is insufficient to evaluate compressed LLMs, as they often exhibit significant qualitative differences and answer flips despite similar accuracy, advocating for additional distance metrics.
Contribution
The paper introduces the importance of using distance metrics like KL-Divergence and flips to better evaluate compressed models beyond accuracy.
Findings
Compressed models show many answer flips despite similar accuracy.
Qualitative evaluation reveals significant differences from baseline models.
KL-Divergence correlates well with answer flips as a metric.
Abstract
When Large Language Models (LLMs) are compressed using techniques such as quantization, the predominant way to demonstrate the validity of such techniques is by measuring the model's accuracy on various benchmarks.If the accuracies of the baseline model and the compressed model are close, it is assumed that there was negligible degradation in quality.However, even when the accuracy of baseline and compressed model are similar, we observe the phenomenon of flips, wherein answers change from correct to incorrect and vice versa in proportion.We conduct a detailed study of metrics across multiple compression techniques, models and datasets, demonstrating that the behavior of compressed models as visible to end-users is often significantly different from the baseline model, even when accuracy is similar.We further evaluate compressed models qualitatively and quantitatively using MT-Bench and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
