TL;DR
This paper introduces a framework with three metrics to better quantify and understand language performance disparities in multilingual large language models, addressing evaluation challenges and revealing fairness issues.
Contribution
It proposes a novel, interpretable framework with metrics that disentangle confounding factors, enabling more accurate assessment of language disparities in multilingual models.
Findings
The framework provides more reliable measurements of model and language disparities.
Higher overall model performance does not guarantee increased fairness across languages.
The approach is effective for evaluating low-resource languages.
Abstract
Results reported in large-scale multilingual evaluations are often fragmented and confounded by factors such as target languages, differences in experimental setups, and model choices. We propose a framework that disentangles these confounding variables and introduces three interpretable metrics--the performance realisation ratio, its coefficient of variation, and language potential--enabling a finer-grained and more insightful quantification of actual performance disparities across both (i) models and (ii) languages. Through a case study of 13 model variants on 11 multilingual datasets, we demonstrate that our framework provides a more reliable measurement of model performance and language disparities, particularly for low-resource languages, which have so far proven challenging to evaluate. Importantly, our results reveal that higher overall model performance does not necessarily…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
