HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning

Alexis Correa-Guill\'en; Carlos G\'omez-Rodr\'iguez; David Vilares

arXiv:2511.15355·cs.CL·March 31, 2026

HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning

Alexis Correa-Guill\'en, Carlos G\'omez-Rodr\'iguez, David Vilares

PDF

1 Datasets

TL;DR

HEAD-QA v2 is a comprehensive healthcare reasoning dataset in Spanish and English, designed to evaluate and improve large language models' biomedical reasoning capabilities.

Contribution

It expands an existing dataset to over 12,000 questions, benchmarks multiple models, and provides multilingual versions to support future biomedical reasoning research.

Findings

01

Model performance mainly depends on scale and reasoning ability.

02

Complex inference strategies yield limited improvements.

03

HEAD-QA v2 is a reliable resource for biomedical reasoning research.

Abstract

We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and G\'omez-Rodr\'iguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

alesi12/head_qa_v2
dataset· 318 dl
318 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.