# Evaluating the Accuracy, Usefulness, and Safety of ChatGPT for Caregivers Seeking Information on Congenital Muscular Torticollis

**Authors:** Siyun Kim, Seoyon Yang, Jaewon Kim, Sunyoung Joo, Hoo Young Lee, Hye Jung Park, Jongwook Jeon, You Gyoung Yi

PMC · DOI: 10.3390/healthcare14020140 · 2026-01-06

## TL;DR

This study evaluates how accurate and safe ChatGPT is for providing information to caregivers about congenital muscular torticollis, finding it generally reliable but with notable gaps.

## Contribution

The study introduces a systematic evaluation of ChatGPT for caregiver-centered health information on CMT using clinical expert ratings and reproducibility metrics.

## Key findings

- ChatGPT showed moderate lexical consistency and high semantic stability in responses.
- Expert ratings revealed moderate to good performance, but some responses lacked clinical detail or safety cautions.
- Human oversight is recommended before using LLM outputs in caregiver education.

## Abstract

Background/Objectives: Caregivers of infants with congenital muscular torticollis (CMT) frequently seek information online, although the accuracy, clarity, and safety of web-based content remain variable. As large language models (LLMs) are increasingly used as health information tools, their reliability for caregiver education requires systematic evaluation. This study aimed to assess the reproducibility and quality of ChatGPT-5.1 responses to caregiver-centered questions regarding CMT. Methods: A set of 17 questions was developed through a Delphi process involving clinicians and caregivers to ensure relevance and comprehensiveness. ChatGPT generated responses in two independent sessions. Reproducibility was assessed using TF–IDF cosine similarity and embedding-based semantic similarity. Ten clinical experts evaluated each response for accuracy, readability, safety, and overall quality using a 4-point Likert scale. Results: ChatGPT demonstrated moderate lexical consistency (mean TF–IDF similarity 0.75) and high semantic stability (mean embedding similarity 0.92). Expert ratings indicated moderate to good performance across domains, with mean scores of 3.0 for accuracy, 3.6 for readability, 3.1 for safety, and 3.1 for overall quality. However, several responses exhibited deficiencies, particularly due to omission of key cautions, oversimplification, or insufficient clinical detail. Conclusions: While ChatGPT provides fluent and generally accurate information about CMT, the observed variability across topics underscores the importance of human oversight and content refinement prior to integration into caregiver-facing educational materials.

## Linked entities

- **Diseases:** congenital muscular torticollis (MONDO:0008583)

## Full-text entities

- **Diseases:** CMT (MESH:C535425)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12840946/full.md

---
Source: https://tomesphere.com/paper/PMC12840946