BertaQA: How Much Do Language Models Know About Local Culture?
Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de, Lacalle, Mikel Artetxe

TL;DR
This paper introduces BertaQA, a bilingual dataset in English and Basque, revealing that large language models have limited knowledge of local cultures but can improve through targeted pre-training, demonstrating knowledge transfer from low-resource languages.
Contribution
The paper presents BertaQA, a novel bilingual dataset for evaluating cultural knowledge in LLMs, and provides evidence of knowledge transfer from low-resource to high-resource languages.
Findings
LLMs struggle with local cultural questions.
Pre-training in Basque improves performance on Basque topics.
Knowledge transfer from low-resource to high-resource languages is possible.
Abstract
Large Language Models (LLMs) exhibit extensive knowledge about the world, but most evaluations have been limited to global or anglocentric subjects. This raises the question of how well these models perform on topics relevant to other cultures, whose presence on the web is not that prominent. To address this gap, we introduce BertaQA, a multiple-choice trivia dataset that is parallel in English and Basque. The dataset consists of a local subset with questions pertinent to the Basque culture, and a global subset with questions of broader interest. We find that state-of-the-art LLMs struggle with local cultural knowledge, even as they excel on global topics. However, we show that continued pre-training in Basque significantly improves the models' performance on Basque culture, even when queried in English. To our knowledge, this is the first solid evidence of knowledge transfer from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsComputational and Text Analysis Methods · Natural Language Processing Techniques · Digital Humanities and Scholarship
