# Clinical Information Extraction From Notes of Veterans With Lymphoid Malignancies: Natural Language Processing Study

**Authors:** Lu He, Matthew R Moldenhauer, Kai Zheng, Helen Ma

PMC · DOI: 10.2196/63908 · JMIR Medical Informatics · 2025-10-16

## TL;DR

This study develops a natural language processing pipeline to extract clinical information from veterans' notes, focusing on rare lymphoid malignancies and identifying racial disparities in performance.

## Contribution

A rule-based cNLP pipeline is developed and validated for rare diseases, highlighting racial disparities in extraction accuracy.

## Key findings

- The pipeline performed well for standard clinical entities like performance status.
- Racial disparities were found in false-positive and false-negative rates for diagnosis and substance use.
- Performance was robust overall but weaker for primary diagnosis and substance use.

## Abstract

Clinical natural language processing (cNLP) techniques are commonly developed and used to extract information from clinical notes to facilitate clinical decision-making and research. However, they are less established for rare diseases such as lymphoid malignancies due to the lack of annotated data as well as the heterogeneity and complexity of how clinical information is documented. In addition, there is increasing evidence that cNLP techniques may be prone to biases embedded in clinical documentation or model development. These biases can result in disparities in performance when extracting clinical information or predicting patient outcomes.

This study aims to report the development and validation of a cNLP pipeline that extracts clinical information such as performance status, staging, and diagnosis, as well as less common information such as substance use and military environmental exposures, from the clinical notes of veterans with lymphoid malignancies.

We developed a rule-based cNLP pipeline that integrates domain expertise. We tested and compared the performance of the cNLP pipeline on notes from 2 veteran patient cohorts: one from non-Hispanic White veterans and the other from non-Hispanic Black veterans.

Overall, our pipeline achieved promising performance on our study data, especially for extracting entities that have standard clinical documentation, such as performance status. We also found that while the pipeline has robust performance across the two patient groups, the false-positive and false-negative rates were significantly associated with race for detecting the primary diagnosis (P=.001 for both); the false-negative rate was significantly associated with race for identifying substance use (P=.02).

The system exhibits satisfying and comparable performance for most clinical entities of interest except for (1) the primary diagnosis and (2) substance use. Future work will address the challenges encountered in developing and deploying the cNLP pipeline on the Department of Veterans Affairs data for rare cancers and enhance the performance of cNLP systems to avoid biases.

## Full-text entities

- **Diseases:** Lymphoid Malignancies (MESH:D008223), cancers (MESH:D009369), use (MESH:D019966)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12530692/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12530692/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/PMC12530692/full.md

---
Source: https://tomesphere.com/paper/PMC12530692