# High-accuracy prediction of mental health scores from English BERT embeddings trained on LLM-generated synthetic self-reports: a synthetic-only method development study

**Authors:** Birger Moëll, Fredrik Sand Aronsson

PMC · DOI: 10.3389/fdgth.2025.1694464 · 2026-01-08

## TL;DR

This study shows that synthetic mental health self-reports generated by an LLM can be used to train models to predict mental health scores with high accuracy, offering a privacy-preserving alternative for method development.

## Contribution

The novelty lies in demonstrating that synthetic-only data can yield high-accuracy mental health score predictions using BERT embeddings and standard ML models.

## Key findings

- PHQ-9 Ridge model achieved an R2 of 0.92 and MSE of 4.41.
- LSAS Gradient Boosting model achieved an R2 of 0.95 and MSE of 75.00.
- PCL-5 Ridge model achieved an R2 of 0.85 and MSE of 35.62.

## Abstract

To assess whether synthetic-only first-person clinical self-reports generated by a large language model (LLM) can support accurate prediction of standardized mental-health scores, enabling a privacy-preserving path for method development and rapid prototyping when real clinical text is unavailable.

We prompted an LLM (Gemini 2.5; July 2025 snapshot) to produce English-language first-person narratives that are paired with target scores for three instruments—PHQ-9 (including suicidal ideation), LSAS, and PCL-5. No real patients or clinical notes were used. Narratives and labels were created synthetically and manually screened for coherence and label alignment. Each narrative was embedded using bert-base-uncased (mean-pooled 768-d vectors). We trained linear/regularized linear (Linear, Ridge, Lasso) and ensemble models (Random Forest, Gradient Boosting) for regression, and Logistic Regression/Random Forest for suicidal-ideation classification. Evaluation used 5-fold cross-validation (PHQ-9/SI) and 80/20 held-out splits (LSAS/PCL-5). Metrics: MSE, R2, MAE; classification metrics are reported for SI.

Within the synthetic distribution, models fit the label–text signal strongly (e.g., PHQ-9 Ridge: MSE 4.41±0.56, R20.92±0.02; LSAS Gradient Boosting test: MSE 75.00, R20.95; PCL-5 Ridge test: MSE 35.62, R20.85).

LLM-generated self-reports encode a score-aligned signal that standard ML models can learn, indicating utility for privacy-preserving, synthetic-only prototyping. This is not a clinical tool: results do not imply generalization to real patient text. We clarify terminology (synthetic text vs. real text) and provide a roadmap for external validation, bias/fidelity assessment, and scope-limited deployment considerations before any clinical use.

## Full-text entities

- **Diseases:** suicidal ideation (MESH:D001072)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12824020/full.md

---
Source: https://tomesphere.com/paper/PMC12824020