Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America
Yannis Karmim (ALMAnaCH), Renato Pino (UCHILE), Hernan Contreras (UCHILE), Hernan Lira, Sebastian Cifuentes (CENIA), Simon Escoffier (PUC), Luis Mart\'i, Djam\'e Seddah (ALMAnaCH), Valentin Barri\`ere (UCHILE, CENIA)

TL;DR
This paper introduces LatamQA, a dataset of over 26,000 culturally informed questions derived from Wikipedia and Wikidata, to evaluate and analyze biases of large language models concerning Latin American cultures.
Contribution
The paper presents a novel dataset and methodology leveraging Wikidata and expert knowledge to assess sociocultural biases in LLMs related to Latin America.
Findings
Models perform better in their original language.
Performance varies across Latin American countries.
Iberian Spanish culture is better represented than Latin American cultures.
Abstract
Large Language Models (LLMs) exhibit inequalities with respect to various cultural contexts. Most prominent open-weights models are trained on Global North data and show prejudicial behavior towards other cultures. Moreover, there is a notable lack of resources to detect biases in non-English languages, especially from Latin America (Latam), a continent containing various cultures, even though they share a common cultural ground. We propose to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of question/answer (Q/As) pairs, based on the different popular and social cultures of various Latin American countries. We create the LatamQA database of over 26k questions and associated answers extracted from 26k Wikipedia articles, and transformed into multiple-choice questions (MCQ) in Spanish…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsComplex Network Analysis Techniques · Topic Modeling · Expert finding and Q&A systems
