Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization
Niyati Bafna, Kenton Murray, David Yarowsky

TL;DR
This paper investigates how linguistic differences affect the performance of large language models on related languages and dialects, using a novel Bayesian noise model to synthesize artificial languages and analyze cross-lingual generalization.
Contribution
It introduces a Bayesian noise-based framework to systematically study linguistic distances and their impact on model performance, enabling better understanding and mitigation of degradation.
Findings
Model robustness varies with phonological, morphological, lexical distances.
Artificial language experiments align with real language data trends.
Framework allows estimation of unseen language performance from high-resource language data.
Abstract
While large language models exhibit certain cross-lingual generalization capabilities, they suffer from performance degradation (PD) on unseen closely-related languages (CRLs) and dialects relative to their high-resource language neighbour (HRLN). However, we currently lack a fundamental understanding of what kinds of linguistic distances contribute to PD, and to what extent. Furthermore, studies of cross-lingual generalization are confounded by unknown quantities of CRL language traces in the training data, and by the frequent lack of availability of evaluation data in lower-resource related languages and dialects. To address these issues, we model phonological, morphological, and lexical distance as Bayesian noise processes to synthesize artificial languages that are controllably distant from the HRLN. We analyse PD as a function of underlying noise parameters, offering insights on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
