HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models
Andrew Maranh\~ao Ventura D'addario

TL;DR
This paper introduces HealthQA-BR, a comprehensive Portuguese healthcare benchmark revealing significant knowledge gaps in large language models across various health professions, emphasizing the need for detailed evaluation beyond overall accuracy.
Contribution
It presents the first large-scale, system-wide healthcare benchmark for Portuguese, assessing multiple health disciplines and exposing systemic knowledge deficiencies in leading LLMs.
Findings
GPT 4.1 achieves 86.6% accuracy overall
Performance varies greatly across specialties, from 98.7% in Ophthalmology to 60.0% in Neurosurgery
Models show systemic 'spiky' knowledge profiles, highlighting safety concerns
Abstract
The evaluation of Large Language Models (LLMs) in healthcare has been dominated by physician-centric, English-language benchmarks, creating a dangerous illusion of competence that ignores the interprofessional nature of patient care. To provide a more holistic and realistic assessment, we introduce HealthQA-BR, the first large-scale, system-wide benchmark for Portuguese-speaking healthcare. Comprising 5,632 questions from Brazil's national licensing and residency exams, it uniquely assesses knowledge not only in medicine and its specialties but also in nursing, dentistry, psychology, social work, and other allied health professions. We conducted a rigorous zero-shot evaluation of over 20 leading LLMs. Our results reveal that while state-of-the-art models like GPT 4.1 achieve high overall accuracy (86.6%), this top-line score masks alarming, previously unmeasured deficiencies. A granular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Electronic Health Records Systems · Machine Learning in Healthcare
