Improving Methodologies for LLM Evaluations Across Global Languages

Akriti Vij; Benjamin Chua; Darshini Ramiah; En Qi Ng; Mahran Morsidi; Naga Nikshith Gangarapu; Sharmini Johnson; Vanessa Wilfred; Vikneswaran Kumaran; Wan Sie Lee; Wenzhuo Yang; Yongsen Zheng; Bill Black; Boming Xia; Frank Sun; Hao Zhang; Qinghua Lu; Suyu Ma; Yue Liu; Chi-kiu Lo; Fatemeh Azadi; Isar Nejadgholi; Sowmya Vajjala; Agnes Delaborde; Nicolas Rolin; Tom Seimandi; Akiko Murakami; Haruto Ishi; Satoshi Sekine; Takayuki Semitsu; Tasuku Sasaki; Angela Kinuthia; Jean Wangari; Michael Michie; Stephanie Kasaon; Hankyul Baek; Jaewon Noh; Kihyuk Nam; Sang Seo; Sungpil Shin; Taewhi Lee; Yongsu Kim; Daisy Newbold-Harrop; Jessica Wang; Mahmoud Ghanem; Vy Hong

arXiv:2601.15706·cs.AI·January 23, 2026

Improving Methodologies for LLM Evaluations Across Global Languages

Akriti Vij, Benjamin Chua, Darshini Ramiah, En Qi Ng, Mahran Morsidi, Naga Nikshith Gangarapu, Sharmini Johnson, Vanessa Wilfred, Vikneswaran Kumaran, Wan Sie Lee, Wenzhuo Yang, Yongsen Zheng, Bill Black, Boming Xia, Frank Sun, Hao Zhang, Qinghua Lu, Suyu Ma, Yue Liu, Chi-kiu Lo

PDF

Open Access

TL;DR

This paper presents a multilingual evaluation of AI safety models across ten languages, revealing safety variability and proposing methodological improvements for more reliable global AI safety assessments.

Contribution

It introduces a collaborative multilingual safety evaluation framework and highlights the importance of culturally contextualized translations and standardized annotation methods.

Findings

01

Safety behaviors vary significantly across languages.

02

Evaluator reliability differs between LLMs and humans.

03

Methodological insights improve multilingual safety testing.

Abstract

As frontier AI models are deployed globally, it is essential that their behaviour remains safe and reliable across diverse linguistic and cultural contexts. To examine how current model safeguards hold up in such settings, participants from the International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the EU, France, Kenya, South Korea and the UK conducted a joint multilingual evaluation exercise. Led by Singapore AISI, two open-weight models were tested across ten languages spanning high and low resourced groups: Cantonese English, Farsi, French, Japanese, Korean, Kiswahili, Malay, Mandarin Chinese and Telugu. Over 6,000 newly translated prompts were evaluated across five harm categories (privacy, non-violent crime, violent crime, intellectual property and jailbreak robustness), using both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Academic integrity and plagiarism