CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

Peiqin Lin; Chenyang Lyu; Wenjiang Luo; Haotian Ye; Md Mehrab Hossain; Chunlan Ma; Shaoxiong Ji; Younes Samih; Bo Zeng; Fan Jiang; Yuanbin Cao; Dilda Duisenbek; Adrian Neo Sau Xun; Daria Pozdniakova; Liubou Misevich; Nevena Marinkovi\'c; Ngoc Gia Linh Nguyen; Thi Khanh Linh Do; Sarakmatak Sophy; Baotian Hu; Guanhua Chen; Gongbo Tang; Alham Fikri Aji; Longyue Wang; and Weihua Luo

arXiv:2604.19262·cs.CL·April 22, 2026

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

Peiqin Lin, Chenyang Lyu, Wenjiang Luo, Haotian Ye, Md Mehrab Hossain, Chunlan Ma, Shaoxiong Ji, Younes Samih, Bo Zeng, Fan Jiang, Yuanbin Cao, Dilda Duisenbek, Adrian Neo Sau Xun, Daria Pozdniakova, Liubou Misevich, Nevena Marinkovi\'c, Ngoc Gia Linh Nguyen, Thi Khanh Linh Do

PDF

TL;DR

CulturALL is a new benchmark designed to evaluate multilingual and multicultural grounded reasoning abilities of LLMs across diverse, real-world scenarios, revealing significant performance gaps.

Contribution

It introduces a comprehensive, human-AI collaboratively created benchmark with 2,610 challenging grounded task samples in 14 languages and 16 topics.

Findings

01

Best LLM achieves only 44.48% accuracy on CulturALL.

02

CulturALL covers diverse languages, regions, and topics.

03

The benchmark reveals substantial room for improvement in LLM grounded reasoning.

Abstract

Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks -- where models must reason within real-world, context-rich scenarios -- largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs' multilingual and multicultural competence on grounded tasks. CulturALL is built via a human--AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.