WHBench: Evaluating Frontier LLMs with Expert-in-the-Loop Validation on Women's Health Topics
Sneha Maurya, Pragya Saboo, and Girish Kumar

TL;DR
WHBench is a specialized benchmark with 47 expert-crafted scenarios across women's health topics, revealing significant safety, accuracy, and equity challenges in current large language models used for medical guidance.
Contribution
The paper introduces WHBench, a new targeted evaluation suite for assessing large language models on women's health, emphasizing failure modes and expert-in-the-loop validation.
Findings
No model exceeds 75% mean performance on WHBench.
Top model achieves 72.1% accuracy, indicating room for improvement.
Models show low fully correct rates and notable harm rate variation.
Abstract
Large language models are increasingly used for medical guidance, but women's health remains under-evaluated in benchmark design. We present the Women's Health Benchmark (WHBench), a targeted evaluation suite of 47 expert-crafted scenarios across 10 women's health topics, designed to expose clinically meaningful failure modes including outdated guidelines, unsafe omissions, dosing errors, and equity-related blind spots. We evaluate 22 models using a 23-criterion rubric spanning clinical accuracy, completeness, safety, communication quality, instruction following, equity, uncertainty handling, and guideline adherence, with safety-weighted penalties and server-side score recalculation. Across 3,102 attempted responses (3,100 scored), no model mean performance exceeds 75 percent; the best model reaches 72.1 percent. Even top models show low fully correct rates and substantial variation in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
