# Risk stratification for long-term inpatient costs in mental disorders: a dual-track machine learning approach using baseline EHRs and hospitalization trajectories

**Authors:** Mengge Zhang, Guoliang Pan, Haohui Shen, Xiuwen He, Jingyi Xiang, Simeng Wang, Mingyang Yao, Yilong Yang

PMC · DOI: 10.1186/s12913-026-14274-y · BMC Health Services Research · 2026-02-28

## TL;DR

This study uses machine learning to predict long-term hospital costs for mental disorders by combining initial patient data with patterns of hospital visits over time.

## Contribution

A dual-track machine learning framework that integrates baseline data and hospitalization trajectories to predict long-term costs in mental disorders.

## Key findings

- Four distinct hospitalization trajectory patterns were identified, with long-term continuous patterns being most costly.
- Adding trajectory data significantly improved model performance (R² increased from 0.35 to 0.71).
- Payment methods, aCCI, Diagnosis groups, and Age were key factors driving hospitalization costs.

## Abstract

Mental disorders (MDs) impose substantial long-term inpatient costs, yet existing prediction models rarely account for dynamic hospitalization trajectories or diagnostic heterogeneity. This study developed and validated a dual-track machine learning framework integrating baseline features with trajectory-derived patterns to predict three-year cumulative hospitalization costs for patients with MDs in China.

We conducted a retrospective cohort study using electronic health records from 3,396 adults with first admission to a psychiatric hospital (2017–2018) and three‑year follow‑up. State sequence analysis and hierarchical clustering identified distinct hospitalization trajectory patterns. Ten baseline variables available at index admission (Set A) and trajectory cluster membership (Set B) were used to train five regression models with stratified 70:30 split and five‑fold cross‑validation. Performance was evaluated using R², RMSE, and MAE on log‑transformed costs. SHAP (SHapley Additive exPlanations) analysis was applied to interpret the optimal model and examine diagnostic heterogeneity.

Four distinct trajectory patterns were identified: low‑frequency short‑stay (64.7%), high‑frequency short‑stay (10.0%), long‑term intermittent (4.8%), and long‑term continuous (20.5%). The gradient boosting machine (GBM) achieved the best test performance using Set A (R² = 0.35), significantly outperforming linear regression (R² = 0.33) and random forest (R² = 0.31). Adding trajectory clusters (Set B) increased R² to 0.71 (ΔR² = 0.36), indicating strong association between long‑term hospitalization patterns and cumulative costs, though this component is only retrospectively explanatory. SHAP identified Payment methods, aCCI, Diagnosis groups, and Age as dominant cost drivers. Model performance was stable for the F2 group (61.8% of cohort) but markedly lower for rare diagnostic subgroups (F0, F1).

Risk stratification for three‑year cumulative hospitalization costs is feasible using only routine baseline information from first admission. The proposed dual‑track framework separates prospective prediction from retrospective explanation, providing a methodologically sound tool for institutional resource planning and high‑risk screening in mental health settings. Future work requires external validation and implementation studies.

The online version contains supplementary material available at 10.1186/s12913-026-14274-y.

## Full-text entities

- **Diseases:** mental disorders (MESH:D001523)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12988640/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12988640/full.md

## References

4 references — full list in the complete paper: https://tomesphere.com/paper/PMC12988640/full.md

---
Source: https://tomesphere.com/paper/PMC12988640