HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models

Andrew Maranh\~ao Ventura D'addario

arXiv:2506.21578·cs.CL·June 30, 2025

HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models

Andrew Maranh\~ao Ventura D'addario

PDF

Open Access 1 Datasets

TL;DR

This paper introduces HealthQA-BR, a comprehensive Portuguese healthcare benchmark revealing significant knowledge gaps in large language models across various health professions, emphasizing the need for detailed evaluation beyond overall accuracy.

Contribution

It presents the first large-scale, system-wide healthcare benchmark for Portuguese, assessing multiple health disciplines and exposing systemic knowledge deficiencies in leading LLMs.

Findings

01

GPT 4.1 achieves 86.6% accuracy overall

02

Performance varies greatly across specialties, from 98.7% in Ophthalmology to 60.0% in Neurosurgery

03

Models show systemic 'spiky' knowledge profiles, highlighting safety concerns

Abstract

The evaluation of Large Language Models (LLMs) in healthcare has been dominated by physician-centric, English-language benchmarks, creating a dangerous illusion of competence that ignores the interprofessional nature of patient care. To provide a more holistic and realistic assessment, we introduce HealthQA-BR, the first large-scale, system-wide benchmark for Portuguese-speaking healthcare. Comprising 5,632 questions from Brazil's national licensing and residency exams, it uniquely assesses knowledge not only in medicine and its specialties but also in nursing, dentistry, psychology, social work, and other allied health professions. We conducted a rigorous zero-shot evaluation of over 20 leading LLMs. Our results reveal that while state-of-the-art models like GPT 4.1 achieve high overall accuracy (86.6%), this top-line score masks alarming, previously unmeasured deficiencies. A granular…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Larxel/healthqa-br
dataset· 665 dl
665 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Electronic Health Records Systems · Machine Learning in Healthcare