HIVMedQA: Benchmarking large language models for HIV medical decision support

Gonzalo Cardenal-Antolin; Jacques Fellay; Bashkim Jaha; Roger Kouyos; Niko Beerenwinkel; Diane Duroux

arXiv:2507.18143·cs.CL·July 28, 2025

HIVMedQA: Benchmarking large language models for HIV medical decision support

Gonzalo Cardenal-Antolin, Jacques Fellay, Bashkim Jaha, Roger Kouyos, Niko Beerenwinkel, Diane Duroux

PDF

Open Access

TL;DR

This paper introduces HIVMedQA, a benchmark for evaluating large language models in HIV medical decision support, revealing their strengths, limitations, and the challenges in clinical applicability.

Contribution

It presents HIVMedQA, a novel benchmark dataset for assessing LLMs in HIV care, and provides a comprehensive evaluation of various models' performance and limitations.

Findings

01

Gemini 2.5 Pro outperformed other models.

02

Performance drops with increased question complexity.

03

Larger models do not always perform better.

Abstract

Large language models (LLMs) are emerging as valuable tools to support clinicians in routine decision-making. HIV management is a compelling use case due to its complexity, including diverse treatment options, comorbidities, and adherence challenges. However, integrating LLMs into clinical practice raises concerns about accuracy, potential harm, and clinician acceptance. Despite their promise, AI applications in HIV care remain underexplored, and LLM benchmarking studies are scarce. This study evaluates the current capabilities of LLMs in HIV management, highlighting their strengths and limitations. We introduce HIVMedQA, a benchmark designed to assess open-ended medical question answering in HIV care. The dataset consists of curated, clinically relevant questions developed with input from an infectious disease physician. We evaluated seven general-purpose and three medically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · HIV/AIDS Research and Interventions