The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models

Yilun Liu; Chunguang Zhao; Mengyao Piao; Lingqi Miao; Shimin Tao; Minggui He; Chenxin Liu; Li Zhang; Hongxia Ma; Jiaxin Guo; Chen Liu; Liqun Deng; Jiansheng Wei; Xiaojun Meng; Fanyi Du; Daimeng Wei; Yanghua Xiao

arXiv:2604.20225·cs.CL·April 23, 2026

The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models

Yilun Liu, Chunguang Zhao, Mengyao Piao, Lingqi Miao, Shimin Tao, Minggui He, Chenxin Liu, Li Zhang, Hongxia Ma, Jiaxin Guo, Chen Liu, Liqun Deng, Jiansheng Wei, Xiaojun Meng, Fanyi Du, Daimeng Wei, Yanghua Xiao

PDF

1 Repo

TL;DR

GaoYao is a comprehensive benchmark designed to evaluate multilingual and multicultural abilities of large language models across 26 languages and 51 nations, addressing limitations of previous benchmarks with deep cultural analysis and expert localization.

Contribution

The paper introduces GaoYao, a new benchmark with extensive cultural and linguistic coverage, a unified evaluation framework, and in-depth diagnostic analysis of LLMs' cultural competencies.

Findings

01

Significant geographical performance disparities in LLMs.

02

Gaps identified between different evaluation tasks.

03

Expert localization improves benchmark quality and coverage.

Abstract

Evaluating the multilingual and multicultural capabilities of Large Language Models (LLMs) is essential for their global utility. However, current benchmarks face three critical limitations: (1) fragmented evaluation dimensions that often neglect deep cultural nuances; (2) insufficient language coverage in subjective tasks relying on low-quality machine translation; and (3) shallow analysis that lacks diagnostic depth beyond simple rankings. To address these, we introduce GaoYao, a comprehensive benchmark with 182.3k samples, 26 languages and 51 nations/areas. First, GaoYao proposes a unified framework categorizing evaluation tasks into three cultural layers (General Multilingual, Cross-cultural, Monocultural) and nine cognitive sub-layers. Second, we achieve native-quality expansion by leveraging experts to rigorously localize subjective benchmarks into 19 languages and synthesizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lunyiliu/GaoYao
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.