Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages

Isaac Chung; Linda Freienthal

arXiv:2602.02287·cs.CL·February 3, 2026

Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages

Isaac Chung, Linda Freienthal

PDF

Open Access 1 Datasets 1 Video

TL;DR

This study examines the stability of large language model evaluations across Finno-Ugric languages, revealing that surface metrics are stable while pragmatic judgments vary, highlighting the need for language-specific calibration.

Contribution

It introduces a controlled evaluation framework to diagnose cross-lingual stability issues in LLM judges, emphasizing the unreliability of zero-shot transfer for discourse-level assessments.

Findings

01

Surface metrics are stable across languages.

02

Pragmatic judgments show rank inversions across languages.

03

Evaluation instability reflects judge scoring behavior, not true model differences.

Abstract

Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

isaacchung/controlled-generated-convos-gpt-4.1-mini
dataset· 17 dl
17 dl

Videos

Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis