Cross-Lingual Response Consistency in Large Language Models: An ILR-Informed Evaluation of Claude Across Six Languages
Camelia Baluta

TL;DR
This study evaluates Claude's responses across six languages using ILR standards, revealing significant cross-lingual variations in length, style, and cultural content through combined automated and expert assessments.
Contribution
It introduces an ILR-based evaluation framework for multilingual LLMs, combining quantitative metrics with expert qualitative analysis to understand cross-lingual response differences.
Findings
French responses are 30% longer than German responses.
Creative responses show the highest surface divergence across languages.
Expert analysis identified five patterns of cross-lingual variation.
Abstract
This paper introduces a systematic evaluation framework grounded in the Interagency Language Roundtable (ILR) Skill Level Descriptions and applies it to Claude (Sonnet 4.6) across six languages: English, French, Romanian, Spanish, Italian, and German. We administer a battery of 12 semantically equivalent prompt clusters spanning ILR complexity levels 1 through 3+, collect 216 responses (12 prompts, 6 languages, 3 runs), and analyze outputs through a two-layer methodology combining automated quantitative metrics with expert ILR qualitative assessment. Quantitative analysis reveals that French responses are approximately 30% longer than German responses on identical prompts, and that creative and affective clusters show the highest cross-lingual surface divergence. Qualitative analysis, conducted by a six-language professional with 12 years of ILR/OPI assessment experience, identifies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
