Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

Sumanth Doddapaneni; Mohammed Safi Ur Rahman Khan; Dilip Venkatesh; Raj Dabre; Anoop Kunchukuttan; Mitesh M. Khapra

arXiv:2410.13394·cs.CL·July 21, 2025

Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, Mitesh M. Khapra

PDF

Open Access 1 Repo 10 Models 2 Datasets 1 Video

TL;DR

This paper introduces the CIA Suite, a framework for evaluating multilingual LLMs using a new test set and a cross-lingual evaluation model, Hercule, which aligns well with human judgments across multiple languages.

Contribution

The paper presents Hercule, a novel cross-lingual evaluation model, and the Recon test set, enabling scalable, accurate multilingual LLM assessment and benchmarking.

Findings

01

Hercule outperforms proprietary models in aligning with human judgments.

02

Hercule is effective in zero-shot evaluation for unseen languages.

03

The CIA Suite facilitates comprehensive multilingual LLM benchmarking.

Abstract

Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai4bharat/cia
pytorchOfficial

Models

Datasets

Videos

Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs· underline

Taxonomy

TopicsNatural Language Processing Techniques · linguistics and terminology studies · Translation Studies and Practices

MethodsFocus · Sparse Evolutionary Training