High-accuracy prediction of mental health scores from English BERT embeddings trained on LLM-generated synthetic self-reports: a synthetic-only method development study
Birger Moëll, Fredrik Sand Aronsson

TL;DR
This study shows that synthetic mental health self-reports generated by an LLM can be used to train models to predict mental health scores with high accuracy, offering a privacy-preserving alternative for method development.
Contribution
The novelty lies in demonstrating that synthetic-only data can yield high-accuracy mental health score predictions using BERT embeddings and standard ML models.
Findings
PHQ-9 Ridge model achieved an R2 of 0.92 and MSE of 4.41.
LSAS Gradient Boosting model achieved an R2 of 0.95 and MSE of 75.00.
PCL-5 Ridge model achieved an R2 of 0.85 and MSE of 35.62.
Abstract
To assess whether synthetic-only first-person clinical self-reports generated by a large language model (LLM) can support accurate prediction of standardized mental-health scores, enabling a privacy-preserving path for method development and rapid prototyping when real clinical text is unavailable. We prompted an LLM (Gemini 2.5; July 2025 snapshot) to produce English-language first-person narratives that are paired with target scores for three instruments—PHQ-9 (including suicidal ideation), LSAS, and PCL-5. No real patients or clinical notes were used. Narratives and labels were created synthetically and manually screened for coherence and label alignment. Each narrative was embedded using bert-base-uncased (mean-pooled 768-d vectors). We trained linear/regularized linear (Linear, Ridge, Lasso) and ensemble models (Random Forest, Gradient Boosting) for regression, and Logistic…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health via Writing · Digital Mental Health Interventions · Mental Health Research Topics
