Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages
Isaac Chung, Linda Freienthal

TL;DR
This study examines the stability of large language model evaluations across Finno-Ugric languages, revealing that surface metrics are stable while pragmatic judgments vary, highlighting the need for language-specific calibration.
Contribution
It introduces a controlled evaluation framework to diagnose cross-lingual stability issues in LLM judges, emphasizing the unreliability of zero-shot transfer for discourse-level assessments.
Findings
Surface metrics are stable across languages.
Pragmatic judgments show rank inversions across languages.
Evaluation instability reflects judge scoring behavior, not true model differences.
Abstract
Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
