Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, Mitesh M. Khapra

TL;DR
This paper introduces the CIA Suite, a framework for evaluating multilingual LLMs using a new test set and a cross-lingual evaluation model, Hercule, which aligns well with human judgments across multiple languages.
Contribution
The paper presents Hercule, a novel cross-lingual evaluation model, and the Recon test set, enabling scalable, accurate multilingual LLM assessment and benchmarking.
Findings
Hercule outperforms proprietary models in aligning with human judgments.
Hercule is effective in zero-shot evaluation for unseen languages.
The CIA Suite facilitates comprehensive multilingual LLM benchmarking.
Abstract
Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ai4bharat/hercule-bnmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗ai4bharat/hercule-demodel· 4 dl4 dl
- 🤗ai4bharat/hercule-frmodel· 14 dl14 dl
- 🤗ai4bharat/hercule-himodel· 7 dl7 dl
- 🤗ai4bharat/hercule-temodel· 5 dl5 dl
- 🤗ai4bharat/hercule-urmodel· 2 dl2 dl
- 🤗ai4bharat/hercule-hi-loramodel· 1 dl· ♡ 11 dl♡ 1
- 🤗ai4bharat/hercule-te-loramodel· 3 dl3 dl
- 🤗ai4bharat/hercule-ur-loramodel· 7 dl· ♡ 17 dl♡ 1
- 🤗ai4bharat/hercule-bn-loramodel· 4 dl4 dl
Videos
Taxonomy
TopicsNatural Language Processing Techniques · linguistics and terminology studies · Translation Studies and Practices
MethodsFocus · Sparse Evolutionary Training
