How Does Quantization Affect Multilingual LLMs?
Kelly Marchisio, Saurabh Dash, Hongyu Chen, Dennis Aumiller, Ahmet, \"Ust\"un, Sara Hooker, Sebastian Ruder

TL;DR
This paper investigates how quantization impacts the performance of multilingual large language models across different languages and tasks, revealing significant discrepancies between automatic metrics and human evaluations.
Contribution
It provides a comprehensive analysis of quantization effects on multilingual LLMs, highlighting language disparities and the importance of human evaluation for accurate assessment.
Findings
Quantization causes more significant performance drops in non-Latin script languages.
Automatic metrics underestimate the true impact of quantization compared to human judgments.
Challenging tasks like mathematical reasoning are most affected by quantization.
Abstract
Quantization techniques are widely used to improve inference speed and deployment of large language models. While a wide body of work examines the impact of quantization on LLMs in English, none have evaluated across languages. We conduct a thorough analysis of quantized multilingual LLMs, focusing on performance across languages and at varying scales. We use automatic benchmarks, LLM-as-a-Judge, and human evaluation, finding that (1) harmful effects of quantization are apparent in human evaluation, which automatic metrics severely underestimate: a 1.7% average drop in Japanese across automatic tasks corresponds to a 16.0% drop reported by human evaluators on realistic prompts; (2) languages are disparately affected by quantization, with non-Latin script languages impacted worst; and (3) challenging tasks like mathematical reasoning degrade fastest. As the ability to serve low-compute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
