INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree, Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed, A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat, Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen

TL;DR
This paper introduces INCLUDE, a large, regional knowledge-focused benchmark with 197,243 QA pairs across 44 languages, to evaluate multilingual LLMs in real-world, culturally relevant contexts, addressing the lack of diverse evaluation resources.
Contribution
It creates a new multilingual benchmark based on regional exam data, emphasizing local knowledge and reasoning, to better assess LLMs' performance in diverse language environments.
Findings
Benchmark covers 44 languages with 197,243 QA pairs.
Evaluates LLMs' regional knowledge and reasoning capabilities.
Highlights performance gaps in multilingual models across different languages.
Abstract
The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/gemma-3n-E2B-it-litert-lmmodel· 5.7k dl· ♡ 3865.7k dl♡ 386
- 🤗google/gemma-3n-E4B-it-litert-lmmodel· 4.9k dl· ♡ 3844.9k dl♡ 384
- 🤗google/gemma-3n-E2B-itmodel· 272k dl· ♡ 290272k dl♡ 290
- 🤗google/gemma-3n-E4B-it-litert-previewmodel· ♡ 1479♡ 1479
- 🤗google/gemma-3n-E4Bmodel· 3.8k dl· ♡ 1363.8k dl♡ 136
- 🤗google/gemma-3n-E4B-itmodel· 50k dl· ♡ 89050k dl♡ 890
- 🤗unsloth/gemma-3n-E2B-it-GGUFmodel· 19k dl· ♡ 6019k dl♡ 60
- 🤗MuXodious/gemma-3n-E4B-it-PaperWitch-heresymodel· 107 dl· ♡ 2107 dl♡ 2
- 🤗google/gemma-3n-E2B-it-litert-previewmodel· ♡ 577♡ 577
- 🤗google/gemma-3n-E2Bmodel· 3.6k dl· ♡ 913.6k dl♡ 91
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
