# Assessing the ability of ChatGPT 4.0 in generating check-up reports

**Authors:** Yikai Chen, Yuxin Liu, Yuanchang Huang, Xiujie Huang, Zhuoqun Zheng, Fangjie Yang, Haiming Lin, Haoyu Lin, Xinxin Li, Aosi Xie, Yiteng Huang

PMC · DOI: 10.3389/fmed.2025.1658561 · Frontiers in Medicine · 2025-10-07

## TL;DR

This study evaluates how well ChatGPT 4.0 can generate health check-up reports, finding it effective in some areas but lacking in others.

## Contribution

The study introduces a systematic evaluation of ChatGPT 4.0's performance in generating health check-up reports using a multi-criteria grading system.

## Key findings

- ChatGPT 4.0 performed well in guideline adherence, diagnosis accuracy, and consistency.
- It struggled with prioritizing high-risk items and providing comprehensive suggestions.
- English reports showed significant differences in grading based on case complexity.

## Abstract

ChatGPT (Chat Generative Pre-trained Transformer), a generative language model, has been applied across various clinical domains. Health check-ups, a widely adopted method for comprehensively assessing personal health, are now chosen by an increasing number of individuals. This study aimed to evaluate ChatGPT 4.0’s ability to efficiently provide patients with accurate and personalized health reports.

A total of 89 check-up reports generated by ChatGPT 4.0 were assessed. The reports were derived from the Check-up Center of the First Affiliated Hospital of Shantou University Medical College. Each report was translated into English by ChatGPT 4.0 and graded independently by three qualified doctors in both English and Chinese. The grading criteria encompassed six aspects: adherence to current treatment guidelines (Guide), diagnostic accuracy (Diagnosis), logical flow of information (Order), systematic presentation (System), internal consistency (Consistency), and appropriateness of recommendations (Suggestion), each scored on a 4-point scale. The complexity of the cases was categorized into three levels (LOW, MEDIUM, HIGH). Wilcoxon rank sum test and Kruskal-Wallis test were selected to examine differences in grading across languages and complexity levels.

ChatGPT 4.0 demonstrated strong performance in adhering to clinical guidelines, providing accurate diagnoses, systematic presentation, and maintaining consistency. However, it struggled with prioritizing high-risk items and providing comprehensive suggestions. In the “Order” category, a significant proportion of reports contained mixed data, several reports being completely incorrect. In the “Suggestion” category, most reports were deemed correct but inadequate. No significant language advantage was observed, with performance varying across complexity levels. English reports showed significant differences in grading across complexity levels, while Chinese reports exhibited distinct performance across all categories.

In conclusion, ChatGPT 4.0 is currently well-suited as an assistant to the chief examiner, particularly for handling simpler tasks and contributing to specific sections of check-up reports. It holds the potential to enhance medical efficiency, improve the quality of clinical check-up work, and deliver patient-centered services.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12537671/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12537671/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/PMC12537671/full.md

---
Source: https://tomesphere.com/paper/PMC12537671