# Improving the Understandability of Clinical Guidelines: Development and Evaluation of a GPT-4–Based Pipeline

**Authors:** Matthew D Jones, Melissa Torgbi, Harish Tayyar Madabushi

PMC · DOI: 10.2196/81915 · Journal of Medical Internet Research · 2026-02-23

## TL;DR

This study explores using GPT-4 to improve the readability of clinical guidelines while ensuring accuracy, finding that it can help but requires careful review.

## Contribution

The novel contribution is a GPT-4–based pipeline for improving clinical guideline readability with evaluation of content preservation and readability.

## Key findings

- BERT scores showed high semantic similarity between original and LLM-revised guidelines.
- LLM revisions improved readability metrics like SMOG grade but not Flesch-Kincaid.
- Expert pharmacists identified some omissions, additions, or meaning changes in LLM-revised guidelines.

## Abstract

Difficulty in finding and understanding information in clinical guidelines contributes to medication errors. Large language models (LLMs) can simplify complex text to aid in understanding, but this approach to improving the quality of guidelines has not been investigated. However, LLMs are also known to hallucinate or generate outputs that may not align with reality.

This study aimed to develop and evaluate an LLM pipeline to improve the readability of clinical guidelines while ensuring the preservation of critical content.

To align LLM revisions with research evidence and enable comparison with manual editing, the National Health Service Injectable Medicines Guide (IMG) was used as a case study, to which a GPT-4–based pipeline was applied, with prompts based on user testing–derived recommendations for IMG authors. This enabled readability comparisons between various IMG guideline versions: original, manually revised, or GPT-4–revised using the user testing–derived recommendations, and fully user tested. Readability was evaluated using readability metrics and ratings from 3 expert pharmacists. Content similarity before and after LLM revision was assessed using BERT (bidirectional encoder representations from transformers) scores and expert pharmacist review.

Considering 20 IMG guidelines used in practice, BERT scores indicated high semantic similarity between the original and LLM-revised guidelines (0.88-0.96). An omission, addition, or change in meaning was identified by at least one pharmacist in 30 (20%), 7 (5%), and 18 (12%) of the 153 guideline subsections, respectively. The SMOG (Simple Measure of Gobbledygook) grade showed a small but significant improvement in readability for the LLM-revised guidelines (mean difference 0.32, 95% CI 0.10‐0.55; P=.02) and the manually revised versions (mean difference 0.46, 95% CI 0.13‐0.79; P=.03). There was no significant difference between the LLM and manually revised versions (P>.99). There were no significant differences between Flesch-Kincaid reading grades (P=.91). Expert ratings favored the LLM-revised versions for understandability. Considering 2 IMG guidelines from previous research, user testing produced a greater improvement in readability than LLM revision.

Authors should not use current LLMs to modify clinical guidelines without carefully checking the revised text for unintended omissions, additions, or changes in meaning. Further work should investigate the potential of LLMs to augment manual user testing and reduce the barriers to the wider use of this approach to improve the safety of clinical guidelines.

## Full-text entities

- **Diseases:** hypertension (MESH:D006973), Flush (MESH:D005483), extravasation (MESH:D005119), LLM (MESH:D007806), Injectable Medicines (MESH:C000719195), hypotension (MESH:D007022)
- **Chemicals:** paracetamol (MESH:D000082), BERT (-), aminophylline (MESH:D000628), furosemide (MESH:D005665), phenytoin (MESH:D010672), vancomycin (MESH:D014640), amiodarone (MESH:D000638), amoxicillin (MESH:D000658), Voriconazole (MESH:D065819), levetiracetam (MESH:D000077287), propofol (MESH:D015742)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12928683/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12928683/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/PMC12928683/full.md

---
Source: https://tomesphere.com/paper/PMC12928683