INCLUDE: Evaluating Multilingual Language Understanding with Regional   Knowledge

Angelika Romanou; Negar Foroutan; Anna Sotnikova; Zeming Chen; Sree; Harsha Nelaturu; Shivalika Singh; Rishabh Maheshwary; Micol Altomare; Mohamed; A. Haggag; Snegha A; Alfonso Amayuelas; Azril Hafizi Amirudin; Viraat; Aryabumi; Danylo Boiko; Michael Chang; Jenny Chim; Gal Cohen; Aditya Kumar; Dalmia; Abraham Diress; Sharad Duwal; Daniil Dzenhaliou; Daniel Fernando; Erazo Florez; Fabian Farestam; Joseph Marvin Imperial; Shayekh Bin Islam,; Perttu Isotalo; Maral Jabbarishiviari; B\"orje F. Karlsson; Eldar Khalilov,; Christopher Klamm; Fajri Koto; Dominik Krzemi\'nski; Gabriel Adriano de Melo,; Syrielle Montariol; Yiyang Nan; Joel Niklaus; Jekaterina Novikova; Johan; Samir Obando Ceron; Debjit Paul; Esther Ploeger; Jebish Purbey; Swati Rajwal,; Selvan Sunitha Ravi; Sara Rydell; Roshan Santhosh; Drishti Sharma; Marjana; Prifti Skenduli; Arshia Soltani Moakhar; Bardia Soltani Moakhar; Ran Tamir,; Ayush Kumar Tarun; Azmine Toushik Wasi; Thenuka Ovin Weerasinghe; Serhan; Yilmaz; Mike Zhang; Imanol Schlag; Marzieh Fadaee; Sara Hooker; Antoine; Bosselut

arXiv:2411.19799·cs.CL·December 2, 2024·3 cites

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree, Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed, A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat, Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen

PDF

Open Access 10 Models 5 Datasets

TL;DR

This paper introduces INCLUDE, a large, regional knowledge-focused benchmark with 197,243 QA pairs across 44 languages, to evaluate multilingual LLMs in real-world, culturally relevant contexts, addressing the lack of diverse evaluation resources.

Contribution

It creates a new multilingual benchmark based on regional exam data, emphasizing local knowledge and reasoning, to better assess LLMs' performance in diverse language environments.

Findings

01

Benchmark covers 44 languages with 197,243 QA pairs.

02

Evaluates LLMs' regional knowledge and reasoning capabilities.

03

Highlights performance gaps in multilingual models across different languages.

Abstract

The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification