# Evaluating the efficacy of large language models in cardio-oncology patient education: a comparative analysis of accuracy, readability, and prompt engineering strategies

**Authors:** Zhao Wang, Lin Liang, Hao Xu, Yuhui Huang, Chen He, Weiran Xu, Haojie Zhu

PMC · DOI: 10.3389/frai.2025.1693446 · 2026-01-13

## TL;DR

This study compares how well large language models perform in providing accurate and easy-to-understand information for cardio-oncology patient education.

## Contribution

The study introduces a comparative evaluation of LLMs in cardio-oncology education, including the impact of prompt engineering on response quality.

## Key findings

- 63.3% of LLM responses were rated as correct, with no significant differences in accuracy between models.
- Prompting reduced readability complexity but compromised comprehensiveness and helpfulness, especially for DouBao.
- Tailored fine-tuning and specialized frameworks are needed to optimize LLMs for this domain.

## Abstract

The integration of large language models (LLMs) into cardio-oncology patient education holds promise for addressing the critical gap in accessible, accurate, and patient-friendly information. However, the performance of publicly available LLMs in this specialized domain remains underexplored.

This study evaluates the performance of three LLMs (ChatGPT-4, Kimi, DouBao) act as assistants for physicians in cardio-oncology patient education and examines the impact of prompt engineering on response quality.

Twenty standardized questions spanning cardio-oncology topics were posed twice to three LLMs (ChatGPT-4, Kimi, DouBao): once without prompts and once with a directive to simplify language, generating 240 responses. These responses were evaluated by four cardio-oncology specialists for accuracy, comprehensiveness, helpfulness, and practicality. Readability and complexity were assessed using a Chinese text analysis framework.

Among 240 responses, 63.3% were rated “correct,” 35.0% “partially correct,” and 1.7% “incorrect.” No significant differences in accuracy were observed between models (p = 0.26). Kimi demonstrated no incorrect responses. Significant declines in comprehensiveness (p = 0.03) and helpfulness (p < 0.01) occurred post-prompt, particularly for DouBao (accuracy: 57.5% vs. 7.5%, p < 0.01). Readability metrics (readability age, difficulty score, total word count, sentence length) showed no inter-model differences, but prompts reduced complexity (e.g., DouBao’s readability age decreased from 12.9 ± 0.8 to 10.1 ± 1.2 years, p < 0.01).

Publicly available LLMs provide largely accurate responses to cardio-oncology questions, yet their utility is constrained by inconsistent comprehensiveness and sensitivity to prompt design. While simplifying language improves readability, it risks compromising clinical relevance. Tailored fine-tuning and specialized evaluation frameworks are essential to optimize LLMs for patient education in cardio-oncology.

## Full-text entities

- **Diseases:** cardio-oncology (MESH:D000072716)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12835249/full.md

---
Source: https://tomesphere.com/paper/PMC12835249