# Development and validation of a machine learning model to identify individuals at high risk for psychotic disorders using medical record data

**Authors:** Ben J. Marafino, Andrea H. Kline-Simon, Icelini Stavers-Sosa, David J. Cronkite, Lawrence D. Gerstley, Cimone Durojaiye, Ann Kelley, Linda Kiel, Arvind Ramaprasan, David S. Carrell, Robert B. Penfold, Matthew E. Hirschtritt

PMC · DOI: 10.1186/s12888-026-07846-z · 2026-02-10

## TL;DR

The study developed a machine learning model using electronic health records to identify young people at high risk of developing psychotic disorders, but found challenges with model calibration due to low disorder incidence.

## Contribution

A novel machine learning model using EHR data to identify high-risk individuals for psychosis in routine clinical settings.

## Key findings

- A gradient-boosting model with text features achieved the highest AUC (0.827) for predicting psychosis risk.
- Model performance was consistent across subgroups but suffered from poor calibration due to low PSD incidence.
- Restricting prediction to higher-risk populations could improve model calibration.

## Abstract

Reducing the duration of untreated psychosis among individuals with early psychosis is associated with improved clinical outcomes and decreased long-term impairment. However, timely identification of individuals at high risk for psychotic disorders in routine clinical practice is challenging, and many individuals are only identified several years following psychotic-symptom onset. This study aimed to leverage comprehensive electronic medical records to develop and validate a machine learning model to identify individuals at high risk of conversion to a psychotic-spectrum disorder (PSD).

This was a cross-sectional, retrospective analysis of electronic health record (EHR) data consisting of clinician free-text documentation and structured data (i.e., age, sex, race/ethnicity, psychiatric diagnoses, encounter modality, and department) among 406,268 Kaiser Permanente Northern California members aged 15–29 years with ≥ 1 primary-care encounter between 2017 and 2019 (~ 1,694,531 encounters). Patients with a new-onset PSD were distinguished from those without a diagnosis if they had ≥ 1 PSD diagnosis within 12 months following the index primary care encounter. The prediction models were developed using cross-validation with the gradient boosting and elastic net algorithms on features extracted from notes, and validated in a random test set.

A gradient-boosting model including text features model yielded the highest area under the curve (AUC 0.827 [95% CI: 0.799 to 0.853]), outperforming an elastic-net model (AUC 0.791 [95% CI 0.760 to 0.821]) and a gradient-boosting model that incorporated only discrete variables (AUC 0.610 [95% CI 0.595 to 0.626]). Model performance was similar across subgroups by sex, age, and race/ethnicity. However, all models exhibited suboptimal calibration, with predicted probabilities systematically underestimating observed PSD risk. Increasing the ratio of PSD cases to non-cases improved discrimination, but worsened calibration. Further, predicted probabilities of developing a PSD compressed with imbalance, causing abrupt metric drops at higher thresholds.

This study suggests that individuals at elevated risk of developing a PSD may be identified from a general clinical population using a machine-learning model trained on routine clinical documentation and structured EHR data. However, the low incidence of PSDs led to suboptimal calibration. Future studies may restrict prediction to populations with higher PSD incidence, such as mental health clinics, to improve model calibration.

Not applicable.

Not applicable.

The online version contains supplementary material available at 10.1186/s12888-026-07846-z.

## Full-text entities

- **Diseases:** psychotic disorders (MESH:D011618)

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12964962/full.md

---
Source: https://tomesphere.com/paper/PMC12964962