A Women's Health Benchmark for Large Language Models

Victoria-Elisabeth Gruber; Razvan Marinescu; Diego Fajardo; Amin H. Nassar; Christopher Arkfeld; Alexandria Ludlow; Shama Patel; Mehrnoosh Samaei; Valerie Klug; Anna Huber; Marcel G\"uhner; Albert Botta i Orfila; Irene Lagoja; Kimya Tarr; Haleigh Larson; Mary Beth Howard

arXiv:2512.17028·cs.CL·December 22, 2025

A Women's Health Benchmark for Large Language Models

Victoria-Elisabeth Gruber, Razvan Marinescu, Diego Fajardo, Amin H. Nassar, Christopher Arkfeld, Alexandria Ludlow, Shama Patel, Mehrnoosh Samaei, Valerie Klug, Anna Huber, Marcel G\"uhner, Albert Botta i Orfila, Irene Lagoja, Kimya Tarr, Haleigh Larson, Mary Beth Howard

PDF

Open Access

TL;DR

This paper introduces the Women's Health Benchmark (WHB), a comprehensive evaluation tool for assessing large language models' accuracy and reliability in providing women's health information across multiple specialties and query types.

Contribution

The paper presents the first dedicated benchmark for women's health in LLMs, covering diverse specialties, query types, and error categories, and evaluates state-of-the-art models on this benchmark.

Findings

01

Models have about 60% failure rate on women's health tasks.

02

Performance varies significantly across specialties and error types.

03

Newer models like GPT-5 show notable improvements in avoiding inappropriate recommendations.

Abstract

As large language models (LLMs) become primary sources of health information for millions, their accuracy in women's health remains critically unexamined. We introduce the Women's Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women's health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling