A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care
Oliver Normand, Esther Borsi, Mitch Fruin, Lauren E Walker, Jamie Heagerty, Chris C. Holmes, Anthony J Avery, Iain E Buchan, Harry Coppock

TL;DR
This study evaluates an LLM-based medication safety review system on real NHS primary care data, revealing high sensitivity but notable failure modes related to contextual reasoning, highlighting challenges for safe clinical deployment.
Contribution
First real-world evaluation of an LLM system for medication safety in NHS primary care, with detailed failure analysis and insights into limitations.
Findings
High sensitivity (100%) in detecting clinical issues.
Moderate specificity (83.1%) and 46.9% accuracy in identifying all issues.
Failure patterns include overconfidence, guideline misapplication, and misunderstanding healthcare delivery.
Abstract
Large language models (LLMs) often match or exceed clinician-level performance on medical benchmarks, yet very few are evaluated on real clinical data or examined beyond headline metrics. We present, to our knowledge, the first evaluation of an LLM-based medication safety review system on real NHS primary care data, with detailed characterisation of key failure behaviours across varying levels of clinical complexity. In a retrospective study using a population-scale EHR spanning 2,125,549 adults in NHS Cheshire and Merseyside, we strategically sampled patients to capture a broad range of clinical complexity and medication safety risk, yielding 277 patients after data-quality exclusions. An expert clinician reviewed these patients and graded system-identified issues and proposed interventions. Our primary LLM system showed strong performance in recognising when a clinical issue is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Electronic Health Records Systems
