# Can large language models be trusted? Reliability and readability of responses to perinatal depression FAQs

**Authors:** Jingyu Huang, Hua Yu, Junjian Chen, Xinyue Wang, Lizhi Huang, Junjie Wen, Hui Li

PMC · DOI: 10.3389/fpubh.2026.1760872 · Frontiers in Public Health · 2026-02-23

## TL;DR

This study evaluates how reliable and easy to understand answers from AI models are when addressing questions about perinatal depression, finding that while they are generally reliable, their readability is often too complex for the general public.

## Contribution

The study introduces a systematic evaluation of LLM responses to perinatal depression FAQs using validated instruments and readability indices.

## Key findings

- LLMs showed moderate to high reliability in answering perinatal depression questions.
- Readability scores exceeded recommended levels, making content difficult for those with lower health literacy.
- Grok4, DeepSeek, and Copilot showed distinct strengths in quality metrics but fell short in clinical safety standards.

## Abstract

Large language models (LLMs), a core technology of generative artificial intelligence (AI), are increasingly used in health education and promotion. Although they may expand access to medical information, concerns remain regarding the reliability and readability of AI generated content for the public. This study evaluated the reliability and readability of answers generated by five LLMs to common questions about perinatal depression. The primary aims were to determine (1) the reliability of LLM responses to frequently asked questions about perinatal depression and (2) whether the readability of the generated content aligns with public health literacy levels.

Twenty-seven frequently asked questions were derived from Google Trends and patient facing resources from the American College of Obstetricians and Gynecologists (ACOG). Each question was submitted to ChatGPT-5, Gemini-2.5, Microsoft Copilot, Grok4, and DeepSeek. Two obstetricians independently rated responses using five validated instruments (DISCERN, EQIP, JAMA, GQS, and HONCODE) and inter-rater agreement was quantified using the interclass correlation coefficient (ICC). Readability was assessed using six indices: ARI, GFI, CLI, OLWF, LWGLF, and FRF. Differences among models were analyzed using the Friedman test.

Inter rater agreement was high across 27 perinatal depression questions. ICC values ranged from 0.729 to 0.847. Significant between model differences emerged for DISCERN, EQIP, and HONCODE. All had p less than 0.001. No overall differences were found for JAMA and GQS. Grok4 scored highest on DISCERN at 60.33 ± 5.48. DeepSeek scored highest on EQIP at 53.04 ± 4.91. Copilot scored highest on HONCODE at 9.26 ± 1.85. These results highlight distinct strengths in quality constructs across instruments. Readability posed a common limitation. All models exceeded the NIH recommended sixth grade level on grade-based indices (for example, ARI ranged from 13.49 ± 2.92 to 15.81 ± 3.25). Similarly, OLWF scores fell well below the sixth-grade benchmark of 94 (ranging from 61.44 ± 6.80 to 72.96 ± 10.39, where higher scores denote easier reading). Most models produced empathetic and informative content. However, they fell short in fully addressing clinical safety standards.

Most LLMs demonstrated moderate to high reliability when responding to perinatal depression questions, supporting their potential as supplementary sources of health information. However, readability levels above recommended benchmarks suggest that current outputs may remain challenging for individuals with lower health literacy. While LLMs improve information accessibility, further improvements in readability, source attribution, and ethical transparency are needed to maximize public benefit and support equitable health communication. Future work should focus on defining and standardizing safety behaviors in high-risk mental health contexts to enable reliable clinical deployment.

## Linked entities

- **Diseases:** perinatal depression (MONDO:0006663)

## Full-text entities

- **Genes:** F11R (F11 receptor) [NCBI Gene 50848] {aka CD321, JAM, JAM1, JAMA, JCAM, KAT}
- **Diseases:** intimate partner violence (MESH:C563733), mental health problems (MESH:D000076082), mood disorder (MESH:D019964), LLMs (MESH:D007806), Depression (MESH:D003866), postpartum (MESH:D006473), self-harm (MESH:D012652), psychiatric (MESH:D001523), prenatal depression (MESH:D049188), Postpartum Depression (MESH:D019052), anxiety (MESH:D001007), ARI (MESH:C566784), HL (MESH:C538324), impaired emotion regulation (MESH:C565631), perinatal depression (MESH:D066087)
- **Chemicals:** Gemini (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12968175/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12968175/full.md

## References

44 references — full list in the complete paper: https://tomesphere.com/paper/PMC12968175/full.md

---
Source: https://tomesphere.com/paper/PMC12968175