Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information

Joshua Harris; Fan Grayson; Felix Feldman; Timothy Laurence; Toby Nonnenmacher; Oliver Higgins; Leo Loman; Selina Patel; Thomas Finnie; Samuel Collins; Michael Borowitz

arXiv:2505.06046·cs.CL·March 10, 2026

Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information

Joshua Harris, Fan Grayson, Felix Feldman, Timothy Laurence, Toby Nonnenmacher, Oliver Higgins, Leo Loman, Selina Patel, Thomas Finnie, Samuel Collins, Michael Borowitz

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces PubHealthBench, a new benchmark for evaluating LLMs' knowledge of UK public health information, revealing high accuracy in multiple choice but lower performance in free responses, highlighting the need for safeguards.

Contribution

The paper presents PubHealthBench, a comprehensive benchmark with over 8000 questions derived from UK government health documents, to assess LLMs' public health knowledge.

Findings

01

Proprietary LLMs achieve >90% accuracy in MCQA.

02

LLMs perform below 75% in free form responses.

03

State-of-the-art LLMs are increasingly accurate but need safeguards.

Abstract

As Large Language Models (LLMs) become widely accessible, a detailed understanding of their knowledge within specific domains becomes necessary for successful real world use. This is particularly critical in the domains of medicine and public health, where failure to retrieve relevant, accurate, and current information could significantly impact UK residents. However, while there are a number of LLM benchmarks in the medical domain, currently little is known about LLM knowledge within the field of public health. To address this issue, this paper introduces a new benchmark, PubHealthBench, with over 8000 questions for evaluating LLMs' Multiple Choice Question Answering (MCQA) and free form responses to public health queries. To create PubHealthBench we extract free text from 687 current UK government guidance documents and implement an automated pipeline for generating MCQA samples.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

1) An original benchmark on public health (here, UK). 2) Both MCQA and open questions. 3) A large evaluation using state-of-the-art LLMs. Also, the authors made the effort to choose an open model (OLMo-2).

Weaknesses

1) Only part of the benchmark has been manually checked (10% of the benchmark). 2) The generation of benchmark questions relies quite heavily on the use of LLM. However, relying on a manual verification of part of the benchmark helps to counterbalance this point. 3) Why not choose to evaluate LLMs adapted to the medical field? 4) Even though we are aware of their limitations, it could have been interesting to have complementary metrics to the LLM-as-a-judge for open-ended questions. A human e

Reviewer 02Rating 4Confidence 4

Strengths

* First comprehensive QA benchmark in the public health domain with a focus on UK health guidance (methodology can generalize to other countries based on the framework) with over 8090 multiple choice QA questions from 687 documents. * Automated, scalable pipeline that can be updated with guidance changes and human expert review of 10% questions * Evaluation of 24 LLMs covering both proprietary and open-weight LLMS on MCQA and free-form responses (based on questions from MCQA) * Benchmark with hu

Weaknesses

* Mismatch in the motivation and the results - one of the biggest claims in the introduction is that public health QA benchmark is necessary as there is risk of hallucinations or incomplete information. However, the results are quite stellar on the benchmark (and already exceeding human baseline). As such, the MCQA results seems to undermine the original motivation and need for the benchmark. Instead, the free-form response seems to be under explored given the performance differences. * There a

Reviewer 03Rating 2Confidence 4

Strengths

- Empathy and ethical alignment are generally missing from traditional benchmarks like MedQA/PubMedQA, so the motivation is timely - A human baseline is used for comparison, which helps to gauge the expectation in real life

Weaknesses

- Most of the datasets are somewhat rehashing existing ones. They have claimed to provide new data but the generation is somewhat contrived. Like varying the gender/age/race to a controlled medical questions. This lacks some theoretical grounding and formalism, and no sensitivity analysis or ablation study. In addition, the data distribution is unknown, like no quantitative description of dataset size, balance, or topic diversity (e.g., how many cardiology vs. psychiatry cases?). And the ground

Code & Models

Datasets

Joshua-Harris/PubHealthBench
dataset· 28 dl
28 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHealth Literacy and Information Accessibility · Topic Modeling · Computational and Text Analysis Methods

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Dropout · Layer Normalization · Byte Pair Encoding · Softmax · Absolute Position Encodings · Residual Connection