High-accuracy prediction of mental health scores from English BERT embeddings trained on LLM-generated synthetic self-reports: a synthetic-only method development study

Birger Moëll; Fredrik Sand Aronsson

PMC · DOI:10.3389/fdgth.2025.1694464·January 8, 2026

High-accuracy prediction of mental health scores from English BERT embeddings trained on LLM-generated synthetic self-reports: a synthetic-only method development study

Birger Moëll, Fredrik Sand Aronsson

PDF

Open Access

TL;DR

This study shows that synthetic mental health self-reports generated by an LLM can be used to train models to predict mental health scores with high accuracy, offering a privacy-preserving alternative for method development.

Contribution

The novelty lies in demonstrating that synthetic-only data can yield high-accuracy mental health score predictions using BERT embeddings and standard ML models.

Findings

01

PHQ-9 Ridge model achieved an R2 of 0.92 and MSE of 4.41.

02

LSAS Gradient Boosting model achieved an R2 of 0.95 and MSE of 75.00.

03

PCL-5 Ridge model achieved an R2 of 0.85 and MSE of 35.62.

Abstract

To assess whether synthetic-only first-person clinical self-reports generated by a large language model (LLM) can support accurate prediction of standardized mental-health scores, enabling a privacy-preserving path for method development and rapid prototyping when real clinical text is unavailable. We prompted an LLM (Gemini 2.5; July 2025 snapshot) to produce English-language first-person narratives that are paired with target scores for three instruments—PHQ-9 (including suicidal ideation), LSAS, and PCL-5. No real patients or clinical notes were used. Narratives and labels were created synthetically and manually screened for coherence and label alignment. Each narrative was embedded using bert-base-uncased (mean-pooled 768-d vectors). We trained linear/regularized linear (Linear, Ridge, Lasso) and ensemble models (Random Forest, Gradient Boosting) for regression, and Logistic…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases1

suicidal ideation

Figures2

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health via Writing · Digital Mental Health Interventions · Mental Health Research Topics