Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
Alhasan Mahmood, Samir Abdaljalil, Hasan Kurban

TL;DR
This paper demonstrates that the language of evaluation significantly impacts agentic code benchmark rankings, revealing the importance of multilingual and localized evaluation methods.
Contribution
It introduces multilingual prompt localization for agent-as-a-judge benchmarks and shows language and backbone interactions affect evaluation outcomes.
Findings
Backbone rankings invert across languages.
No single backbone dominates in all languages.
Localization of judge instructions impacts satisfaction scores.
Abstract
Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge's language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three developer-agent frameworks and six judge backbones, totaling 4950 judge runs. The central finding is that backbone and language interact: GPT-4o achieves the highest satisfaction in English (44.72\%), while Gemini leads in Arabic (51.72\%, vs.\ GPT-4o) and Hindi (53.22\%). No single backbone dominates across all languages, and inter-backbone agreement on individual requirement judgments is modest (Fleiss' ). A controlled ablation further shows that localizing judge-side instructions, not just benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
