EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering

Valle Ruiz-Fern\'andez; Mario Mina; J\'ulia Falc\~ao; Luis Vasquez-Reina; Anna Sall\'es; Aitor Gonzalez-Agirre; Olatz Perez-de-Vi\~naspre

arXiv:2507.11216·cs.CL·July 16, 2025

EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering

Valle Ruiz-Fern\'andez, Mario Mina, J\'ulia Falc\~ao, Luis Vasquez-Reina, Anna Sall\'es, Aitor Gonzalez-Agirre, Olatz Perez-de-Vi\~naspre

PDF

Open Access 3 Datasets

TL;DR

This paper introduces EsBBQ and CaBBQ, new social bias benchmarks for question answering in Spanish and Catalan, to evaluate biases in language models across different social contexts and languages.

Contribution

It provides the first parallel social bias datasets for Spanish and Catalan, adapting the BBQ benchmark to these languages and social settings, enabling bias assessment beyond English.

Findings

01

Models often fail in ambiguous social scenarios.

02

Higher QA accuracy correlates with increased social bias reliance.

03

Bias evaluation varies across model size and family.

Abstract

Previous literature has largely shown that Large Language Models (LLMs) perpetuate social biases learnt from their pre-training data. Given the notable lack of resources for social bias evaluation in languages other than English, and for social contexts outside of the United States, this paper introduces the Spanish and the Catalan Bias Benchmarks for Question Answering (EsBBQ and CaBBQ). Based on the original BBQ, these two parallel datasets are designed to assess social bias across 10 categories using a multiple-choice QA setting, now adapted to the Spanish and Catalan languages and to the social context of Spain. We report evaluation results on different LLMs, factoring in model family, size and variant. Our results show that models tend to fail to choose the correct answer in ambiguous scenarios, and that high QA accuracy often correlates with greater reliance on social biases.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInterpreting and Communication in Healthcare · Topic Modeling · Natural Language Processing Techniques