# Unsupervised discovery of clinical disease signatures using probabilistic independence

**Authors:** Thomas A. Lasko, William W. Stead, John M. Still, Thomas Z. Li, Michael Kammer, Marco Barbero-Mota, Eric V. Strobl, Bennett A. Landman, Fabien Maldonado

PMC · DOI: 10.1016/j.jbi.2025.104837 · Journal of biomedical informatics · 2026-01-05

## TL;DR

This study uses probabilistic independence to uncover hidden disease sources and their effects in electronic health records, helping identify causes of benign and malignant lung nodules.

## Contribution

A novel method for unsupervised discovery of clinical disease signatures using probabilistic independence in EHR data.

## Key findings

- The model recovered 92% of malignant and 30% of benign causes in the reference standard.
- Top inferred causes included novel findings with supporting evidence in the literature.
- Causal models showed similar predictive accuracy to associational baselines despite uncovering more detailed disease sources.

## Abstract

This study uses probabilistic independence to disentangle patient-specific sources of disease and their signatures in Electronic Health Record (EHR) data.

We model a disease source as an unobserved root node in the causal graph of observed EHR variables (laboratory test results, medication exposures, billing codes, and demographics), and a signature as the set of downstream effects that a given source has on those observed variables. We used probabilistic independence to infer 2000 sources and their signatures from 9195 variables in 630,000 cross-sectional training instances sampled at random times from 269,099 longitudinal patient records. We evaluated the learned sources by using them to infer and explain the causes of benign vs. malignant pulmonary nodules in 13,252 records, comparing the inferred causes to an external reference list and other medical literature. We compared models trained by three different algorithms and used corresponding models trained directly from the observed variables as baselines.

The model recovered 92% of malignant and 30% of benign causes in the reference standard. Of the top 20 inferred causes of malignancy, 14 were not listed in the reference standard, but had supporting evidence in the literature, as did 11 of the top 20 inferred causes of benign nodules. The model decomposed listed malignant causes by an average factor of 5.5 and benign causes by 4.1, with most stratifying by disease course or treatment regimen. Predictive accuracy of causal predictive models trained on source expressions (Random Forest AUC 0.788) was similar to (p = 0.058) their associational baselines (0.738).

Most of the unrecovered causes were due to the rarity of the condition or lack of sufficient detail in the input data. Surprisingly, the causal model found many patients with apparently undiagnosed cancer as the source of the malignant nodules. Causal model AUC also suggests that some sources remained undiscovered in this cohort.

These promising results demonstrate the potential of using probabilistic independence to disentangle complex clinical signatures from noisy, asynchronous, and incomplete EHR data that represent the confluence of multiple simultaneous conditions, and to identify patient-specific causes that support precise treatment decisions.

## Linked entities

- **Diseases:** cancer (MONDO:0004992)

## Full-text entities

- **Diseases:** cancer (MESH:D009369), benign nodules (MESH:D016606), pulmonary nodules (MESH:D055613)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12767692/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12767692/full.md

## References

98 references — full list in the complete paper: https://tomesphere.com/paper/PMC12767692/full.md

---
Source: https://tomesphere.com/paper/PMC12767692