Average Is Not Enough: Caveats of Multilingual Evaluation
Mat\'u\v{s} Pikuliak, Mari\'an \v{S}imko

TL;DR
This paper highlights the limitations of using average performance metrics in multilingual evaluation, emphasizing the need for qualitative linguistic analysis and visualization to detect biases favoring dominant language families.
Contribution
It introduces a methodology combining qualitative linguistic analysis with visualization tools to identify biases in multilingual evaluation results.
Findings
Published results can be linguistically biased.
Visualization with URIEL database detects evaluation biases.
Qualitative analysis is essential for fair multilingual assessment.
Abstract
This position paper discusses the problem of multilingual evaluation. Using simple statistics, such as average language performance, might inject linguistic biases in favor of dominant language families into evaluation methodology. We argue that a qualitative analysis informed by comparative linguistics is needed for multilingual results to detect this kind of bias. We show in our case study that results in published works can indeed be linguistically biased and we demonstrate that visualization based on URIEL typological database can detect it.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
