# Adherence of Free-Tier Large Language Models to the 2024 European Society of Cardiology (ESC) Guidelines for the Management of Elevated Blood Pressure and Hypertension: A Comparative Study

**Authors:** Aleksander Polus, Dawid Boczkowski, Rania Suleiman, Bartosz Palacz, Natalia Marianna Kubis, Julia Anna Wrona, Wiktor Perz, Maria Magdalena Teper, Anhelina Korolchuk, Jedrzej Piotrowski, Anna Gluzicka, Anna Matyas, Aleksander Tuteja, Piotr Sawina, Aleksandra Wielochowska

PMC · DOI: 10.7759/cureus.104111 · 2026-02-23

## TL;DR

This study compares how well free large language models follow new 2024 European Society of Cardiology guidelines for managing high blood pressure.

## Contribution

First comparative analysis of free-tier LLMs' adherence to the 2024 ESC hypertension guidelines using physician-verified questions.

## Key findings

- All three LLMs showed high accuracy with no significant differences in guideline adherence.
- Claude 4.5 Sonnet had the highest accuracy at 82.5%.
- Models exhibited a tendency toward overly aggressive clinical recommendations.

## Abstract

Background

Hypertension remains the leading modifiable risk factor for cardiovascular disease and premature death worldwide. In 2024, the European Society of Cardiology (ESC) released updated guidelines for the management of elevated blood pressure and hypertension. Concurrently, the integration of artificial intelligence into healthcare has accelerated, with large language models (LLMs) becoming accessible tools for information retrieval.

Objective

This study aims to evaluate and compare the accuracy and adherence of three popular free-tier LLMs (ChatGPT-5.2, Gemini 3 Flash, and Claude 4.5 Sonnet) in responding to questions based strictly on the 2024 ESC Guidelines.

Methods

We conducted a comparative cross-sectional study in January 2026 to evaluate the performance of three LLMs. The primary source of ground truth was the 2024 ESC Guidelines. A dataset of 40 specific questions was generated, covering key domains including diagnosis, treatment targets, lifestyle modifications, and comorbidities. Questions comprised both factual queries and clinical case reports. Responses were categorized by a qualified physician as correct, inaccurate, or incorrect based strictly on guidelines. Statistical analysis was performed using the Fisher-Freeman-Halton exact test to evaluate differences in performance.

Results

The overall accuracy across all models was high, with no statistically significant differences in performance observed (p>0.99). Claude 4.5 Sonnet achieved the highest numerical accuracy, providing correct responses to 33 out of 40 questions (82.5%). ChatGPT-5.2 and Gemini 3 Flash achieved identical correctness rates of 80.0% (32 out of 40 correct answers). A qualitative analysis revealed a distinct tendency toward overly aggressive management in complex clinical scenarios, suggesting a "safety bias" where models default to intensive intervention rather than nuanced guideline steps.

Conclusions

The evaluated free-tier LLMs demonstrated comparable and high proficiency in interpreting the 2024 ESC Guidelines. Despite this potential, the study identified a recurrent safety bias manifesting as a tendency toward over-medicalization. While these models serve as promising auxiliary tools for medical education, verification of AI-generated advice against official guideline documents remains essential.

## Full-text entities

- **Diseases:** premature death (MESH:D003643), cardiovascular disease (MESH:D002318), Elevated Blood Pressure (MESH:D006973)

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC13015764/full.md

---
Source: https://tomesphere.com/paper/PMC13015764