# Challenges of using generative AI for patient education in chronic heart failure: an evaluation of content quality, readability, and actionability in cross-platform LLM-generated texts

**Authors:** Zhiqiang Wang, Xiaoya Li, Chao Ma, Zhiwen Zhang

PMC · DOI: 10.3389/fpubh.2026.1801829 · 2026-03-05

## TL;DR

This study evaluates how well different AI platforms generate patient education materials for chronic heart failure, finding trade-offs between readability and information completeness.

## Contribution

The paper introduces a framework for assessing LLM-generated patient education content and identifies platform-specific strengths and weaknesses.

## Key findings

- Doubao and Kimi K2 produced the highest overall quality texts for patient education.
- DeepSeek-R1 provided the most complete information but had the lowest readability.
- ERNIEBot 4.5 Turbo and Qwen3-Max-Thinking-Preview were most readable but less comprehensive.

## Abstract

To compare the differences in content quality, readability, and actionability of patient education texts for self-management of chronic heart failure (CHF) generated by five mainstream large language models (LLMs) in China, and to provide a basis for platform selection and assessment framework construction for clinical use.

A standardized set of 20 questions was developed based on literature review, guidelines, and consensus from cardiovascular experts, covering disease awareness, diagnosis and classification, treatment and rehabilitation, daily management and prevention, and psychosocial dimensions. Using a uniform prompt, responses were generated by DeepSeek-R1, Doubao, ERNIEBot 4.5 Turbo, Qwen3-Max-Thinking-Preview, and Kimi K2. The PEMAT-P scale was used to assess understandability and actionability, 36-item expanded EQIP (EQIP-36 score) scale was used to evaluate information completeness and standardization, and Global Quality Score (GQS) was used to assess overall quality. Additionally, seven readability formulas, including Flesch Reading Ease Score (FRES) and Flesch–Kincaid Grade Level (FKGL), were computed for comparison.

Overall quality was high [GQS median 5.00 (4.00–5.00)] with significant between-platform differences (χ2 = 14.47, P = 0.006). Doubao and Kimi K2 achieved the highest GQS [both 5.00 (5.00–5.00)]. DeepSeek-R1 showed the greatest information completeness [EQIP-36 39.20 (36.17–44.23); χ2 = 25.07, P < 0.001] but the lowest readability [FRES 19.32 (17.94–36.89) and FKGL 14.28 (13.02–15.85); both P < 0.001]. ERNIEBot 4.5 Turbo and Qwen3-Max-Thinking-Preview were most readable (FRES ≈ 59; FKGL ≈ 8; both P < 0.001) but had lower EQIP-36 scores. Actionability was limited overall [PEMAT-P actionability 20.00% (0.00–40.00); χ2 = 26.40, P < 0.001] and varied by topic, with daily management and prevention outperforming disease knowledge and diagnosis/classification (χ2 = 20.86, P < 0.001).

LLMs show potential for use in patient education for CHF, but there is a structural trade-off between information detail and readability, as well as gaps in actionability and verifiability. It is recommended to combine enhanced search and structured template generation strategies, and establish a governance feedback loop involving prompt engineering, clinical expert review, and continuous monitoring to improve readability alignment, completeness of action instructions, and patient safety.

## Full-text entities

- **Diseases:** CHF (MESH:D006333)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12999856/full.md

---
Source: https://tomesphere.com/paper/PMC12999856