# Enhancing personalized suicide risk prediction for VA patients by integrating discrete natural language processing models

**Authors:** Monica Dimambro, Joshua Levy, Jiang Gui, Matan Goldberg, Brian Shiner, Maxwell Levis

PMC · DOI: 10.1038/s41398-026-03940-8 · Translational Psychiatry · 2026-03-20

## TL;DR

This study explores how natural language processing can improve suicide risk prediction for Veterans, especially those at lower risk levels.

## Contribution

The novel contribution is evaluating two NLP methods to enhance suicide risk prediction for VA patients, particularly in low- and moderate-risk groups.

## Key findings

- Models using count or hybrid variables outperformed semantic variables in predictive accuracy.
- Low- and moderate-risk patients showed the most improvement with the new NLP methods.
- The approach expands suicide prediction to underserved patient populations.

## Abstract

To improve the identification of Veterans at risk for suicide, the U.S. Department of Veterans Affairs (VA) developed REACH-VET, a suicide risk classification metric. Our previous work demonstrated that incorporating natural language processing (NLP) and developing targeted models for distinct suicide risk-tiers could enhance REACH-VET’s predictive accuracy. This study evaluates the benefits of two NLP methods and compares their predictive performance across risk-tiers. We created a sample of VA patients who either died by suicide in 2017–2018 (cases) or remained alive during that period (controls), stratified by suicide risk (high, moderate, low). We analyzed unstructured electronic health record (EHR) notes using two NLP models: 1) theory-based, closed-vocabulary “semantic” methods, and 2) data-driven, open-vocabulary “count” methods. We then developed eXtreme Gradient Boosting (XGBoost) classification models using semantic, count, and hybrid (count and semantic) variables and calculated area under the receiver operating characteristic curve to assess predictive accuracy. Generally, classification models using semantic variables performed worse than count or hybrid variables. The highest added benefit was seen for low- and moderate-risk patients who achieved incremental improvements in performance over and above leading predictive benchmarks. By using different NLP techniques on unstructured EHR data, our approach improves predictive accuracy for lower-risk patients, expanding suicide prediction for patient populations who are poorly understood and frequently underserved.

## Full-text entities

- **Diseases:** sexual assault/abuse (MESH:D000082002), suicidal ideation (MESH:D001072), anxiety (MESH:D001007), dementia (MESH:D003704), substance abuse (MESH:D019966), psychiatric (MESH:D001523), Mental (MESH:D008607), death (MESH:D003643)
- **Chemicals:** alcohol (MESH:D000438), XGBoost (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13039941/full.md

## References

7 references — full list in the complete paper: https://tomesphere.com/paper/PMC13039941/full.md

---
Source: https://tomesphere.com/paper/PMC13039941