Predicting Suicide Among US Veterans Using Natural Language Processing-enriched Social and Behavioral Determinants of Health
Avijit Mitra, Kun Chen, Weisong Liu, Ronald C. Kessler, Hong Yu

TL;DR
This study shows that using natural language processing to extract social and behavioral health data from medical records improves suicide prediction models for US Veterans.
Contribution
The novel contribution is demonstrating how NLP-extracted unstructured data enhances suicide prediction models in veterans' health records.
Findings
Incorporating NLP-extracted SBDH significantly improved predictive model performance across multiple timeframes.
Random forest models showed notable improvements in AUC and precision-recall metrics after adding NLP data.
Enhanced suicide prediction was observed within 180 days of discharge using NLP-enriched data.
Abstract
Despite recognizing the critical association between social and behavioral determinants of health (SBDH) and suicide risk, SBDHs from unstructured electronic health record (EHR) notes for suicide predictive modeling remain underutilized. This study investigates the impact of SBDH, identified from both structured and unstructured data utilizing a natural language processing (NLP) system, on suicide prediction within 7, 30, 90, and 180 days of discharge. Using EHR data of 2,987,006 Veterans between October 1, 2009, and September 30, 2015, from the US Veterans Health Administration (VHA), we designed a case-control study that demonstrates that incorporating structured and NLP-extracted SBDH significantly enhances the performance of three architecturally distinct suicide predictive models - elastic-net logistic regression, random forest (RF), and multilayer perceptron. For example, RF…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSuicide and Self-Harm Studies · Health disparities and outcomes · Mental Health Treatment and Access
Introduction
Suicide has consistently ranked among the primary causes of mortality in the US for decades, with a substantial 35.6% increase from 2000 to 2021^1^. In 2021 alone, suicide accounted for 48,183 fatalities in the US^1^, while the global toll surpassed 700,000 ^2,3^. Existing data indicates a higher suicide rate among Veterans than non-veteran adults over the last decade and notably, Veterans are experiencing a more pronounced increase of suicide risk ^4^. Prior studies found that 80% of suicide victims were in contact with their primary care providers in the year preceding their death, and within the same timeframe, 25.7–31% had sought mental health care^5,6^. This puts the healthcare providers in a unique position to contribute, and a better predictive tool may assist them in mitigating the prospective risk of suicidal events.
Social and behavioral determinants of health (SBDH) encompass factors such as socioeconomic status, access to healthy food, education, housing etc. that wield strong influence over an individual’s health outcomes ^7^. Prior studies established strong relationships between SBDHs and suicidal behaviors^8–12^. For example, social disruptions (e.g., relationship dissolution, financial insecurity, legal problems, and exposure to childhood adversity) exhibit significant associations with suicidal behaviors^8,12–15^. However, leveraging SBDHs for predicting suicide has presented challenges, primarily due to the limitations in structured data sources, such as ICD codes, for capturing comprehensive and reliable SBDH information. Unstructured clinical notes, enriched with detailed SBDH information, can play a vital role in this regard^1 2,16^.
The increasing use of Electronic Health Records (EHR) in the US has stimulated efforts to identify patients at suicide risk using EHR data. This has resulted in data mining and machine learning approaches to predict suicidal behavior and suicide mortality among patients in large healthcare systems^17,18^. While most of the existing work on suicide risk assessment using SBDH has focused on structured data sources, unstructured EHR notes represent a relatively untapped data source that can be accessed relatively inexpensively. With the advent of advanced natural language processing (NLP) techniques, there are large opportunities to automate SBDH extraction from EHR notes to augment the structured data, aiding healthcare providers with a more holistic view of a patient’s overall health status and suicide risk^19,20^.
The US Department of Veterans Affairs (VA) operates the largest integrated healthcare network in the country, with a national EHR system used by more than 1,200 medical centers and clinics^21^. With great public concern about the health of Veterans, the VA presents a unique opportunity to fully leverage its data for the exploration of suicide-related predictive modeling. In this study, we conducted the first retrospective case-control study to examine the impact of both structured and NLP-extracted (from unstructured notes) SBDH on suicide death among Veterans. We evaluated three architecturally distinct suicide prediction models across multiple prediction windows. As detailed below, our findings showed that SBDHs can improve all models’ predictive performance across different prediction windows.
Methods
Data Source and Study Design
In this study, we used inpatient and outpatient EHRs from the US Department of Veterans Affairs Veteran Health Administration (VHA) Corporate Data Warehouse. We included all discharges from outpatient emergency room and inpatient care between October 1, 2009 (start of Fiscal Year [FY] 2010) and September 30, 2015 (end of FY 2015) and following Kessler et al.^22^, the unit of analysis was hospital discharge. Our study protocol was approved by the institutional review board of VA Bedford Health Care. The Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD)^23^ reporting guidelines were followed.
Cases were defined as discharges followed by deaths from suicide (according to National Death Index^24^ with International Classification of Diseases (ICD), Tenth Revision, codes X60–X84, Y87.0, and/or U03 as underlying cause of death) in the next D days (‘prediction window’). From each discharge, we established a 2-year retrospective ‘observation window’ to aggregate all relevant information for prediction. Each case was randomly matched, without replacement, to 5 discharges that were not followed by suicide in the prediction window (controls), on discharge type and date (± 1 year). Our discharge inclusion criteria include 1) at least one diagnosis or procedure record within the observation window, 2) patients at least 18 years old with no conflicting demographic information, and 3) discharges at least D days before the study end date (September 30, 2015).
We analyzed 4 prediction windows − 7, 30, 90, and 180 days, resulting in 4 cohorts. For each discharge, our task was to predict death by suicide within the prediction window, given all data from the observation window. We put aside the discharges from FY 2015 as the hold-out test data and used remaining discharges for training.
Predictor Construction
We categorized all predictors into four groups: demographics, codes, suicide behavioral (SB) information, and SBDH. The demographic predictors contain patients’ race, gender, age, and marital status. Codes include diagnosis codes, procedure codes, medication codes. Since diagnoses and procedure codes are hierarchical, encoding all of them may lead to overfitting. We therefore used the single-level Clinical Classification Software (CCS) [1] to categorize them. This led to 283 categories for diagnosis codes and 248 categories for procedure codes. We also categorized medication codes following VA National Formulary^25^ (VANF) drug classification. More details are available in Appendix 1. SB information includes suicide attempt (SA) and ideation (SI), obtained using the phenotype algorithm available through the VA’s Centralized Interactive Phenomics Resource (CIPHER)[2].
SBDHs were identified from structured data using ICD-9 and VHA stop codes (structured SBDH), and clinical notes using NLP (NLP-extracted SBDH). Structured SBDHs included 6 factors - social or familial problems, employment or financial problems, housing instability, legal problems, violence, and non-specific psychosocial needs. NLP-extracted SBDHs were obtained from unstructured clinical notes using a transformer-based^26^ NLP system^12^, and comprised 12 factors - social isolation, job or financial insecurity, housing instability, legal problems, barriers to care, violence, transition of care, food insecurity, substance abuse, psychiatric symptoms, pain, and patient disability. SBDHs were extracted from the following 9 note types - emergency department notes, nursing assessments, primary care notes, hospital admission notes, inpatient progress notes, pain management, mental health notes, social worker notes, and discharge summaries. To assess the impact of SBDH from both sources, we combined them to create 13 distinct SBDH factors (Appendix 2). In addition to individual-level SBDHs mentioned above, we also included neighborhood-level socioeconomic variable - area deprivation index (ADI)[3] which represents the socioeconomic status of a patient’s neighborhood. ADI includes state and national-level rankings of neighborhoods based on socioeconomic disadvantages. A higher ADI indicates a lower socioeconomic status. We linked each patient’s EHR data to the ADI database via their address zip code and discharge quarter of the calendar year to identify the corresponding national-level ranking and included that as a predictor.
We extracted all predictors from the observation window except diagnosis codes, SB and SBDH (excluding ADI). Diagnosis codes were extracted only from the discharge day as this yielded the best performance in our initial experiments. To capture prior documented SA and SI, we extracted SB data from any time before the current discharge date. Furthermore, we varied the time frame for SBDH to investigate how their proximity to discharge affects subsequent suicide. We chose 7 (a week), 30 (a month), 90 (3 months), 180 (6 months), 365 (1 year) and 730 days (2 years) as candidate time frames. To provide the model a sense of time-variability, we also used SBDH predictors extracted from all six time windows simultaneously.
In summary, we considered 619 candidate predictors (Table 1): 4 demographic variables, 283 diagnoses codes variables, 248 procedures codes variables, 50 medication codes variables, 2 SB variables, 6 structured SBDH variables, 12 NLP-extracted SBDH variables, 13 combined SBDH variables, and 1 ADI variable. Demographic and ADI variables were categorical, whereas the remaining predictors were constructed as binary variables - indicating the absence or presence.
Predictor Screening
Predictor screening was performed on the binary features of diagnoses, procedure, and medication codes. First, we removed any of these predictors with a low prevalence of less than 1%. Next, for each remaining predictor, we fit a univariate logistic regression model of suicide death on the predictor and the demographic variables. We evaluated the p-values of the predictors from these univariate models and used the Benjamini-Hochberg procedure^27^ to control the false discovery rate (FDR) at 10%. Only predictors with an adjusted p-value smaller than 0.1 were used as candidate predictors to build the predictive models. Our two-stage screening reduced 87.4%, 78.51%, 75.77% and 71.41% of the predictors for the case-control cohorts with 7, 30, 90 and 180-day prediction windows respectively. Prior works suggest that predictor screening can help with noise reduction and substantially improve out-of-sample model performance^28,29^. We stress that SBDH variables were excluded from the screening stage as the focus of this work is on analyzing their impact on the prediction of suicide.
Statistical Analyses
We employed three different machine learning (ML) methods for predictive modeling, namely, elastic-net logistic regression (ENL), random forest (RF), and multilayer perceptron (MLP). For the ENL and RF models, we used 10-fold cross-validation on the training data and performed grid searches over a wide range of hyperparameters to select the best models. For MLP, we used a 2-layer feed-forward network with ReLU^30^ as the activation function. To tune the hyperparameters of MLP, we set apart 20% of the training data as the validation set. As our cohorts had a case-control ratio of 1:5, we used cost-sensitive learning^31^ for all models to ensure that they prioritized suicide events as equally as non-suicide events. For ENL and RF, we averaged all metrics over the 10 folds. For MLP, we averaged the model performance over three runs with different seeds. We experimented with different combinations of predictors, as shown in Tables 2 and 3. For SBDH, we experimented with the following combinations: structured SBDH, NLP-extracted SBDH, combined SBDH, structured SBDH + ADI, NLP-extracted SBDH + ADI, and Combined SBDH + ADI.
To evaluate the models’ predictive performance on the test data, we examined various performance metrics on the test data, including the area under the receiver operator characteristic curve (ROC AUC), area under the precision recall curve (PR AUC), sensitivity, specificity, and positive predictive value (PPV). Since suicide is a rare event, we calculated sensitivity, specificity and PPV for different risk group sizes. A risk group size P for a predictive model indicates the fraction of the test set with the highest risk for suicide, as identified by the model. Following prior studies^22,32^ and our data statistics, we included 0.05, 0.10, 0.20 and 0.60 as different values for P. As this is a case-control study, we also reported adjusted PPV^33^. PPV denotes the probability of predicted high-risk patients with suicide death. The measurement of PPV is important as this indicates the chances of saving patients’ lives with interventions.
In addition, we conducted calibration analysis and measured predictor importance using the Kernel SHAP (Shapley Additive Explanations) method^34^. For each model, we chose PR AUC to select the best hyperparameter configuration. All analyses used Python 3.8, ENL and RF were implemented using scikit-learn^35^ 0.23.1 and MLP was implemented using PyTorch^36^ 1.5.1.
Results
Prevalence of Suicide
Out of 17,267,304 discharges from 2,987,006 Veterans (Fig. 1), 17,210,996 were eligible to be considered for the 7-day prediction with 849 cases, amounting to 0.005% suicide rate at the discharge level. At the patient level, the suicide rate within 7 days of discharge was 0.03%, with 849 suicide deaths from 2,703,173 patients. Similarly, the suicide rates within 180 days of discharge were 0.05% at the discharge level and 0.27% at the patient level. In summary, the 4 case-control cohorts for prediction windows 7, 30, 90 and 180 days consisted of 5,094 (849 cases and 4,245 controls), 14,256 (2,376 cases and 11,880 controls), 29,580 (4,930 cases and 24,650 controls) and 46,668 discharges (7,778 cases and 38,890 controls) respectively. More details are available in Appendix 3.
Overall Model Performance
The results are shown in Tables 2 and 3. With ‘SBDH’ as predictors, we only reported results for the combinations that yielded the best PR AUC scores. We noticed incremental improvements across almost all models and prediction windows as we added a new predictor group. Adding codes and SB information always improved the AUC scores (Table 2). A similar trend can also be observed with SBDHs. However, the best SBDH setting for PR AUC did not always yield the best ROC AUC score.
ENL achieved the best AUC scores for the 7 and 30-day prediction windows, except MLP attaining the best PR AUC in the 7-day prediction window. In contrast, RF achieved the best AUCs across 90 and 180-day prediction windows. In general, models for the shortest prediction window (7 days) had the lowest ROC AUCs (74.44%–77.65%), and as prediction windows got longer, the models performed better with the highest ROC AUCs (77.39%–83.94%) obtained for the longest prediction window (180 days). PR AUC scores demonstrated a similar trend. AUC scores were almost always higher among outpatient ED discharges than inpatient discharges.
Across all prediction windows with the best predictor configuration, these models detected 12.98–24.58% of all deaths from suicide at the 5% risk tier (Table 3). This means that even considering only 5% of the discharges with the highest model-assigned suicide risk, a suicide intervention program based on these models can capture 12.98%–24.58% of patient discharges where the patients would otherwise die by suicide. Increasing the risk group size can help capture even more discharges, for example, 24.97%–41.14% at a 10% risk group size. PPVs and adjusted PPVs increase as the prediction window increases and the risk group size decreases. We obtained the highest adjusted PPV of 1.07% for the RF model over the 180-day prediction window at the 5% risk tier. This suggests that in the top 5% risk tier, patients from 1.07% of discharges would die by suicide within 180 days of their hospital discharges in the absence of any additional intervention program.
Impact of NLP-extracted Predictors
In this study, we used an NLP system to extract SBDH from clinical notes. We compared our NLP-extracted SBDHs with structured and combined SBDHs. eTable5 lists all the SBDH combinations that yielded the best performance for each model at a specific prediction window. In half of the settings (6 out of 12), NLP-extracted SBDHs appeared as the best choice whereas structured SBDH performed better in four settings. We also found ADI to be helpful in most settings.
Calibration and Predictor Importance
Out of the three models, RF is better calibrated than others (eFigure 1–2). However, there was no noticeable difference between a model with and without SBDH (eFigure 2). We also measured predictor importance using Kernel SHAP method (eFigure 4). Based on SHAP values, we identified predictors that pushed a model towards making positive predictions (suicide death) and predictors that did the opposite. We named them positive and negative predictors, respectively. Upon examining the top 30 positive predictors, we found that SA, SI, and the age group 79 or higher are the most common predictors across different models and prediction windows. In contrast, black race, female gender, and age 50–59 were the most consistent negative predictors in the top 30. Among diagnoses predictors, ‘Administrative/social admission’, ‘COPD’, ‘alcohol-related disorders’, and ‘anxiety disorders’ were the most common positive predictors. As for procedure categories, ‘anesthesia’ was a common positive predictor, whereas ‘cardiac stress tests’ was a common negative predictor. Among medications, ‘sedative hypnotics’ was a prominent positive predictor and ‘antidepressants’ was a common negative predictor. Among SBDHs, ‘Social isolation’ (NLP-extracted) and ‘violence’ (structured) were two of the most common positive predictors. We would like to emphasize that SHAP values do not indicate risk or protective factors; rather, they help rank predictors according to their usefulness for a task (suicide prediction) with respect to a model (ENL, RF, or MLP).
Ensemble Learning
Ensembling is a popular technique for aggregating multiple models’ predictions to improve system robustness. Among various aggregator functions such as linear averaging, majority voting, boosting, etc., we chose linear averaging for our study. First, for each model, we averaged the prediction probabilities over all folds/runs and then, we averaged them over different models. We did this for the two best models (ENL and RF) and all three models. The results are shown in Table 4. We found that ensembling ENL and RF improved the AUC scores over the best single models for 7, 30, and 90-days prediction windows. However, the performance did not improve for the 180-days prediction window. Comparatively, ensembling all models was only helpful for prediction window 7. Overall, the RF model is still better calibrated than the ensembled systems (eFigure 3).
Discussion
To the authors’ knowledge, this is the first case-control study to examine the roles of NLP-extracted SBDHs in predicting suicide among US Veterans. We found that models with SBDH predictors outperformed models without SBDH. For example, the RF model with no SBDH achieved an ROC AUC of 83.57% and a PR AUC of 57.38% in the 180-day prediction window. After adding ADI and NLP-extracted SBDH (timeframe = 730 days), the AUCs increased to 84.25% (0.81% improvement, 95% CI = 0.63–0.98, p-val < 0.001) and 59.87% (4.34% improvement, 95% CI = 3.86–4.82, p-val < 0.001) respectively. Moreover, when compared with structured SBDH, NLP-extracted SBDHs yielded competitive or better performance in most situations.
SBDH improved the performance for all cases, with ROC AUC improvements going up to 3.86% and PR AUC improvements up to 11.21%. This is consistent with prior studies^22,37^ where multiple SBDH factors were identified as important predictors for suicide after discharge from VA psychiatric hospitalization. However, they lacked a robust deep-learning-based SBDH extraction system from clinical notes. Our results also showed that all models benefitted from including NLP-extracted SBDHs in combination with other SDBHs or alone. This highlights the merit of harnessing clinical notes through NLP to enrich SBDH information for improved predictive modeling.
Our work showed that near-term prediction of suicide death is more challenging than longer-time predictions; as such, all models performed the best with 180-day prediction window, and the performance kept declining as the prediction window decreased. This may partly stem from the lack of adequate samples in shorter prediction windows, making it more challenging for any model to map the predictors to suicide. Other studies suggested that larger number of suicides over longer windows increase predictive models’ statistical power ^22,37^. They found that models built to predict suicide over longer windows outperform models built to predict over shorter windows when applied at those shorter windows.
We also ranked the predictors using their SHAP values (eFigure 4). We discovered that records of prior SA and SI are two of the most important predictors for death by suicide across all prediction windows. SA is well-established as a significant risk factor for suicide^3,38^. Data indicates that one out of every 100 attempt survivors dies from suicide within the first year, a risk approximately 100 times higher than that observed in the general population^39^. Furthermore, the risk of suicide can persist up to 32 years following an attempt^40^. A systematic review of 90 studies found a 6.7% suicide completion rate and a 23% non-fatal attempt rate^41^. We also found ‘social isolation’ and ‘violence’ as two of the most common positive SBDH predictors. Prior studies showed a higher association between social isolation and suicide risk^12,42–45^. Exposure to violence is also a well-known risk factor for suicidality^12,46,47^.
Using NLP to extract clinically relevant information from EHR notes is not new. Datta et al. reviewed 78 studies that utilized NLP to extract cancer-related information^48^. Mitra et al. developed a deep-learning-based NLP system to extract social determinants of health from EHR notes and showed their significant associations with suicide among US Veterans^12^. Bhanu et al. designed an NLP system to extract SB information from EHR notes^49^. Many other works also used NLP systems to detect suicidality in EHR notes^50–53^. However, ours is the first case-control study to incorporate NLP-extracted SBDHs as predictors for suicide death prediction.
Although predicting suicidal behavior has been an active area of research^17,22,28,54,55^, our study differs in the addition of NLP-extracted SBDH as predictors to analyze their impact on a diverse set of models’ predictive performance. Despite many existing studies on the prediction of suicide, integrating their findings to existing healthcare systems poses a multitude of challenges, such as lack of logistics support at the deployment centers, risk-benefit tradeoff, cost-effectiveness, a sense of false reassurance^22^, and generalizability, among others. Moreover, a systematic review of 17 suicide prediction studies found that all predictive models suffer from low PPV, regardless of the population distribution or risk tier ^56^, thus, making suicide prediction a challenging task. In contrast, Kessler et al. showed that predictive models have positive net benefit across plausible ranges of the PPV distribution^37^.
Limitations and Future Work
Our study has several limitations. Firstly, the VA population’s demographic composition differs from that of the overall US population. Nonetheless, research utilizing VHA data has informed non-VA facilities in implementing enhanced clinical practices ^57–59^. Additionally, our study employed no VA-exclusive predictors, allowing for the extraction of the same predictors from EHRs at non-VA facilities for customized prediction models. Secondly, our analysis focused solely on outpatient emergency and inpatient care discharges. Expanding to include other hospital settings could enhance our comprehension of SBDHs’ impact on suicide. We leave this for our future work Thirdly, we restricted the observation window to 2 years to incorporate relatively current SBDHs but extending it to encompass historical SBDHs may enhance model predictions, a subject we will explore in future research. Lastly, we utilized the ADI, available only at the census tract block group level; however, we plan to investigate the recently proposed social vulnerability metric^60^ as an alternative in future studies.
Conclusions
Ours is the first large-scale study to use NLP-extracted SBDH information from unstructured EHR data to predict suicide among Veterans. We showed that incorporating NLP-extracted SBDH exhibited improved predictive performance across different models and prediction windows. Consequently, integrating NLP-extracted SBDH into structured EHR data holds a promising avenue for the advancement of a more effective suicide prevention system.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Suicide Data and Statistics | Suicide Prevention | CDC. https://www.cdc.gov/suicide/suicide-data-statistics.html.
- 2Wang H. Global, regional, and national life expectancy, all-cause mortality, and cause-specific mortality for 249 causes of death, 1980–2015: a systematic analysis for the Global Burden of Disease Study 2015. Lancet 388, 1459–1544 (2016).27733281 10.1016/S 0140-6736(16)31012-1PMC 5388903 · doi ↗ · pubmed ↗
- 3Suicide. https://www.who.int/news-room/fact-sheets/detail/suicide.
- 4National Veteran Suicide Prevention Annual Report. Office of Mental Health and Suicide Prevention (2021).
- 5Walby F. A., Myhre M. Ø. & Kildahl A. T. Contact With Mental Health Services Prior to Suicide: A Systematic Review and Meta-Analysis. Psychiatr Serv 69, 751–759 (2018).29656710 10.1176/appi.ps.201700475 · doi ↗ · pubmed ↗
- 6Stene-Larsen K. & Reneflot A. Contact with primary and mental health care prior to suicide: A systematic review of the literature from 2000 to 2017. Scand J Public Health 47, 9–17 (2019).29207932 10.1177/1403494817746274 · doi ↗ · pubmed ↗
- 7Healthy People.gov. Social Determinants of Health | Healthy People 2020. Healthy People 2020 Topics and Objectives 5–8 Preprint at https://www.healthypeople.gov/2020/topics-objectives/topic/social-determinants-of-health (2014).
- 8Blosnich J. R. Social Determinants and Military Veterans’ Suicide Ideation and Attempt: a Cross-sectional Analysis of Electronic Health Record Data. J Gen Intern Med 35, 1759–1767 (2020).31745856 10.1007/s 11606-019-05447-z PMC 7280399 · doi ↗ · pubmed ↗
