Unsupervised discovery of clinical disease signatures using probabilistic independence

Thomas A. Lasko; William W. Stead; John M. Still; Thomas Z. Li; Michael Kammer; Marco Barbero-Mota; Eric V. Strobl; Bennett A. Landman; Fabien Maldonado

PMC · DOI:10.1016/j.jbi.2025.104837·January 5, 2026

Unsupervised discovery of clinical disease signatures using probabilistic independence

Thomas A. Lasko, William W. Stead, John M. Still, Thomas Z. Li, Michael Kammer, Marco Barbero-Mota, Eric V. Strobl, Bennett A. Landman, Fabien Maldonado

PDF

Open Access

TL;DR

This study uses probabilistic independence to uncover hidden disease sources and their effects in electronic health records, helping identify causes of benign and malignant lung nodules.

Contribution

A novel method for unsupervised discovery of clinical disease signatures using probabilistic independence in EHR data.

Findings

01

The model recovered 92% of malignant and 30% of benign causes in the reference standard.

02

Top inferred causes included novel findings with supporting evidence in the literature.

03

Causal models showed similar predictive accuracy to associational baselines despite uncovering more detailed disease sources.

Abstract

This study uses probabilistic independence to disentangle patient-specific sources of disease and their signatures in Electronic Health Record (EHR) data. We model a disease source as an unobserved root node in the causal graph of observed EHR variables (laboratory test results, medication exposures, billing codes, and demographics), and a signature as the set of downstream effects that a given source has on those observed variables. We used probabilistic independence to infer 2000 sources and their signatures from 9195 variables in 630,000 cross-sectional training instances sampled at random times from 269,099 longitudinal patient records. We evaluated the learned sources by using them to infer and explain the causes of benign vs. malignant pulmonary nodules in 13,252 records, comparing the inferred causes to an external reference list and other medical literature. We compared models…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases3

cancer benign nodules pulmonary nodules

Figures4

Click any figure to enlarge with its caption.

c](#F1)). Specifically, $[eqn]$ may be a standard probabilistic statistical model such as a Random Forest or a deep architecture and still produce causal predictions of $[eqn]$ given $[eqn]$ . The model $[eqn]$ does not explicitly receive the values of any $[eqn]$ , and we do not assume that it internally reconstructs the causal relationships among the $[eqn]$ , although a sufficiently powerful architecture may infer some version of them in constructing the effect of $[eqn]$ on $[eqn]$ . This analysis depends on certain assumptions, some of which are common to causal inference, and others enab

Keywords

Phenotype discoveryMachine learningElectronic health recordsCausal inferenceIndeterminate pulmonary noduleLung cancer

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Biomedical Text Mining and Ontologies · Artificial Intelligence in Healthcare

Full text

Background and significance

A clinical disease label is an abstraction over a collection of unobserved pathophysiologic mechanisms that produce an overlapping set of clinical manifestations. The label can be an effective way to reduce the complexity of clinical practice, but the lack of precision can lead to treatment failure when the patient’s specific disease mechanism does not respond to the typical treatment [1–9].

Identifying the mechanistic pathways underlying a given disease label is an important but labor-intensive undertaking involving biochemical or clinical experiments. However, we may be able to get partway there by identifying the set of changes (or signatures) that each unobserved mechanism has on the observable variables in the Electronic Health Record (EHR). Our premise is that similar but distinct mechanisms leave similar but distinct signatures in the record.

Our goal is to computationally identify these signatures, and from them draw some inference about their unobserved mechanisms. We want to understand causes of disease at the individual patient level, and to perform the inference task at institutional scale in the number of variables, training instances, and learned causes. (We will use the term cause in its technical sense of an ancestor node in a causal graph. From a clinical perspective, a cause can be any factor that if manually changed would affect the downstream probability of a disease. It may include a direct cause, a risk factor, a protective factor, or a preventative measure. A cause is distinguished from an association or a correlate, which may have a common upstream cause with the disease in question, or may be a downstream effect of the disease, but if it were manually changed would not actually affect the disease.).

We require an unsupervised approach because our motivating observation is that the existing clinical labels are insufficiently precise. Instead, we want to allow the data to speak directly, modeling the unobserved disease mechanisms as unobserved root nodes (or latent disease sources) in the causal graph of observed EHR variables, and identifying a signature for each latent source. We have previously named this task phenotype discovery[10], and success requires overcoming both inherent and practical difficulties.

The major inherent difficulty arises from the fact that patients simultaneously experience multiple conditions that together employ a shared vocabulary of manifestations, creating a record that is a confluence of patterns that must be disentangled.

The major practical difficulty arises from the fact that medical records are neither clean nor complete[11]: data are observed sparsely, irregularly, and asynchronously, but not at random; billing codes and problem lists don’t always reflect the true or current patient state [12,13]; medication lists incompletely capture whether medications are prescribed, appropriately dosed, filled, and taken as prescribed[14,15]; and clinical measurements are subject to error in analytic, pre-analytic, and post-analytic processes[16].

Previous approaches

1.1.

We are unaware of any prior work that achieves our goal of inferring patient-specific causes of disease from institutional-scale EHR data (much of it infers predictive instead of causal features), but we review major steps that have been taken in this direction.

A common approach to phenotype discovery has been to partition a set of patients into disjoint clusters and then identify the salient clinical pattern in each cluster[17–25], but it turns out that this hard clustering approach does not address the inherent difficulty of the problem. Hard clustering is appropriate when each record legitimately belongs to exactly one cluster, such as partitioning dwellings into geographic regions. But it fails for phenotype discovery because it identifies only one pattern per patient, allowing strong patterns to dominate subtle patterns. In medical record data, strong patterns tend to reflect demographics, common conditions unrelated to the disease of interest, or even cohort inclusion criteria.

A better approach is to adapt methods of representation learning[26], which structures raw elements of a record into multiple meaningful or explanatory features. Representation learning was originally designed to replace the practice of engaging experts to create predictive features manually. Phenotype discovery falls into the subproblem of causal representation learning[27], which imposes the additional constraint that the explanatory patterns should represent elements of the data’s cause-effect network. Disentangling a patient record into a causal representation would not only identify the precise causes of the patient’s diseases, but also allow us to answer interventional questions, such as ‘What would be the effect on the patient’s future health outcomes if we intervened to remove this cause?’ If treatment response data can be included in the model, this is the type of question that can lead to precision treatment decisions.

The work that began the popular wave of deep learning used a multi-layer autoencoder for disentangling[28]. The disentangling was guided by minimizing reconstruction error under a constrained number of explanatory factors. The main contribution of the work was to demonstrate learning nonlinear components one layer at a time that were intended to be a useful decomposition of the data space (with the meaning of useful determined by the constraints imposed on each hidden layer), but with no attention to causal properties.

The earliest phenotype discovery from EHR data that we know of used a similar autoencoder guided by sparsity, L_2_ regularization, and uncertainty-normalized reconstruction error to find explanatory features in irregular longitudinal observations of a single laboratory test result[10]. Other autoencoder variants use robustness to noise[29], factor decorrelation, hierarchical structure, and many other guiding principles[26,30]. Variational autoencoders are considered by some to be the state of the art for disentanglement, although using them to recover true latent sources remains an open problem, in part because of the difficulty of designing a loss function that achieves probabilistic independence of the learned explanatory factors[31], or one that can otherwise guide the learning toward causal rather than predictive features.

The question of how to judge a learned representation (and therefore how to guide the learning) is also an open one, and is more complex than judging the accuracy of a classification or a regression[26]. Early work, including applications in medicine, nevertheless used predictive accuracy on one or more supervised problems as a measure of representation quality[10,29,32,33–38]. Predictive accuracy is a reasonable baseline measure, particularly if the supervised application used for evaluation did not directly guide the feature learning. But the most predictive features may not necessarily be those with the greatest explanatory power, and may not actually represent the pathophysiologic pathways.

Using independence to guide disentangling

1.2.

The degree of probabilistic independence between inferred sources can be a useful signal to guide disentangling[39,40], and has the additional benefit that it can produce sources with our desired causal properties[27,41]. In phenotype discovery, the intuition is that the true sources of different diseases act independently of each other, so their observable patterns in a patient record should also occur independently. Of course, some diseases are caused by others, such as myocardial infarction caused in part by atherosclerosis, and these relationships form a potentially dense causal network. However, we consider every disease to have an additional unobserved (or latent) cause of its own that is independent of all other causes, and our task is to find these mutually independent latent causes. These latent causes form the set of root nodes of that network because none of them have observed parents.

More formally, a causal graph corresponding to a set of observed variables $[eqn]$ can be estimated from a large set of observations made under different conditions or at different times in different patients (Fig. 1a). In such a graph, we consider the value of each observed variable $[eqn]$ to be a function $[eqn]$ of its observed parents $[eqn]$ , and an unobserved parent $[eqn]$ representing a latent source.

The $[eqn]$ are often called error terms, but their function is much more important than simply adding noise to the observations. In effect, they are the locations in the graph where all outside information is injected before being distributed among the observed variables. We can think of the graph as existing in a steady state until a latent source changes its value, which then causes downstream changes in some observed variables[41]. For source $[eqn]$ , the set of downstream changes follows a specific pattern or *signature $[eqn]$ * that depends on the strength of causal relationships in the network (Fig. 1 a, b). The $[eqn]$ are considered root causes of the changes in the graph because they are the roots, or parent-less nodes, of the causal graph. In phenotype discovery from EHR data, each $[eqn]$ is in fact the output of an unobserved and potentially large upstream network of biochemical and environmental processes. The true pathophysiologic cause of disease is contained somewhere in that upstream network, but the latent $[eqn]$ is as close as we can get to that cause given the observed variables in the data.

The mutual independence of the $[eqn]$ values (or source expressions) is used to guide the disentangling. In general, if the functions $[eqn]$ and the distributions of the latent sources $[eqn]$ can be arbitrarily chosen, it is impossible to recover them from observations alone, but a solution can be found by imposing additional assumptions[27,31]. If we assume linear functions and non-Gaussian distributions for the source expressions, with a sufficiently comprehensive set of variables $[eqn]$ and minimal confounding (which we achieve by using large-scale EHR data), methods such as ICA and Linear Non-Gaussian Acyclic Models (LiNGAM)[40,42] can recover both the sources $[eqn]$ and the signatures $[eqn]$ by the linear decomposition $[eqn]$ . The causal graph itself can then be recovered from $[eqn]$ [43], or Direct LiNGAM can recover it directly from $[eqn]$ .

Relaxing the requirement of linear $[eqn]$ can be done if additional assumptions are made, which is an active research direction [27,31,44–52,53].

Empirically, source independence has been demonstrated to be an objectively more accurate principle than other disentangling approaches. On a task in the genomic domain, objectively comparing 42 different methods of inferring groups of genes that are functionally related and coregulated, investigators found that independence-based methods outperformed all others tested[54]. A different group found that transcriptome structure inferred using ICA was conserved across five datasets, despite those datasets coming from different research groups using different technologies[55].

Identifying causal sources

1.2.1.

The fact that the latent sources $[eqn]$ are mutually independent root nodes implies that any learned model $[eqn]$ will be a causal model of $[eqn]$ given $[eqn]$ , regardless of the architecture of $[eqn]$ (Fig. 1c). Specifically, $[eqn]$ may be a standard probabilistic statistical model such as a Random Forest or a deep architecture and still produce causal predictions of $[eqn]$ given $[eqn]$ . The model $[eqn]$ does not explicitly receive the values of any $[eqn]$ , and we do not assume that it internally reconstructs the causal relationships among the $[eqn]$ , although a sufficiently powerful architecture may infer some version of them in constructing the effect of $[eqn]$ on $[eqn]$ . This analysis depends on certain assumptions, some of which are common to causal inference, and others enable the ICA decomposition. Supplemental Section A1.6.1 details the assumptions and the formal causal definitions. Also see previous work[41] for the full rigorous development of the causal nature of $[eqn]$ .

If $[eqn]$ is a disease label for a given patient, the elements of $[eqn]$ that meaningfully affect the prediction are the estimated latent causes for that patient’s disease. If those causes have identifiable treatments, then those treatments would be the optimal and precise treatments for the patient.

In contrast, the elements of $[eqn]$ are mutually dependent and represent entangled effects, so the common approach of building a predictive statistical model $[eqn]$ (Fig. 1d) would not identify causal sources without attention to recovering the causal graph among the $[eqn]$ as part of $[eqn]$ . Specifically, we would not expect a Random Forest, a deep architecture, or any other standard machine learning approach to produce a causal model under the design $[eqn]$ .

Identifying patient-specific latent sources is the first step in the direction of identifying patient-specific treatments. Understanding what those latent sources represent pathophysiologically and what treatments may affect them are important future steps.

Contributions

1.3.

In this work, we present 1) the use of probabilistic independence to disentangle data signatures of latent sources of disease from routinely collected, episodic, and noisy EHR data at institutional scale, 2) an evaluation of this approach by estimating patient-specific causes of indeterminate pulmonary nodules found either incidentally or by screening, and 3) discovery of potential patient-specific signals of existing but undiagnosed cancer as causes of the nodules.

Materials and methods

The inference pipeline included data collection, transformation, and signature discovery components (Fig. 2). We summarize these methods here, with more detailed descriptions in the supplemental material Section A1.

This research was approved as non-human-subjects research by the Vanderbilt University Institutional Review Board (#210761).

Data collection

2.1.

Vanderbilt University Medical Center is a tertiary care center with an extensive EHR covering about 3 million patients, with nearly all inpatient and outpatient records complete after 2005. We used two datasets drawn from our EHR: a Discovery Set, comprising 269,099 records of patients with a broad range of lung disease, used to learn the latent clinical signatures; and an Evaluation Set comprising 13,252 records of patients with an indeterminate pulmonary nodule and no prior history of any cancer, used to train and test the predictive model. The Evaluation Set is a subset of the Discovery Set.

We extracted n = 9,195 variables from all records, including clinical measurements, billing codes, medication mentions, and demographics. Records in the evaluation set were labeled positive for patients who received a billing code for any lung malignancy over the 3 years after the nodule appearance, and negative for patients who did not (Table 1). A set of 2651 records (20%) from the Evaluation Set were randomly partitioned into the final test set.

For the Evaluation Set, data from each record were collected from the time of the first pulmonary nodule code and earlier. No information except the label was included from after the nodule date. Labels were positive if the record included a code for malignant lung neoplasm on day 4 – 1095 following the pulmonary nodule date. This interval begins on day 4 to exclude patients diagnosed with cancer on the same day as the nodule discovery, allowing for variability in coding timing of the two types of codes. The intent is to identify the causes of benign vs. malignant nodules in patients for whom that was unknown at the time of nodule discovery.

Labels were validated in a separate study[56] by comparison with cancer registry records (all positive labeled records) and manual chart review (all mismatches with the cancer registry and a random sample of negative labeled records), and estimated to have 0.98 PPV, 0.99 NPV, 0.93 sensitivity, and 0.996 specificity (supplemental Section A1.1.2).

Data transformation

2.2.

To overcome some of the practical difficulties arising from the messiness of EHR data (Section 1), our data were converted from point observations to probabilistic continuous longitudinal curves with a specific transformation for each data mode that preserves its essential characteristics but abstracts away the problems of sparse, irregular, and asynchronous observations (supplemental Section A1.2). Clinical measurements were fit with a smooth interpolation that maintains non-stationarity, or a constant population median if a patient had no observations of a given test. Billing codes were represented as a longitudinal intensity curve of code occurrence events per unit time, minimizing the effects of errors by treating the events as a probabilistic signal instead of ground truth. Medication mentions were transformed to piecewise-constant 0/1 curves representing the presence or absence of the medication in a record at a given date. Race and sex were represented as constant 0/1 curves, binarized from categorical variables when appropriate.

Next, curves for each patient were time-aligned and stacked, with cross sections sampled uniformly at random at a mean density of 1 sample per 3 record-years. An individual record may be randomly sampled once, more than once, or not at all, with longer records tending to be sampled more times. A total of $[eqn]$ sampled cross sections were stacked into a final data matrix $[eqn]$ . Each column was standardized to place them all onto roughly the same scale.

Clinical signature discovery

2.3.

The matrix $[eqn]$ of stacked cross sections was decomposed by FastICA [57] into $[eqn]$ , where $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ (Fig. 2). The rows $[eqn]$ are (approximately) mutually independent by construction, the column $[eqn]$ represents the learned clinical signature of latent source $[eqn]$ , and matrix element $[eqn]$ represents the level at which cross section $[eqn]$ expresses source $[eqn]$ . The choice of the number of latent sources $[eqn]$ was limited by computational complexity and available resources; we suspect the optimal $[eqn]$ is much higher (Fig. A1 in supplemental methods).

The scale of each $[eqn]$ is non-identifiable under ICA, so expressions of each source are arbitrarily scaled individually to zero mean and 0.5 standard deviation in the Discovery Set (Fig. 3, insets), so a record may express a source either positively or negatively. Expressing a source negatively means the record manifests the opposite effect from what appears in the visualized signature.

Signatures in $[eqn]$ were produced with only a sequential identifier, and no clinical interpretation or descriptive name. Where a name was needed for reporting purposes, we provided one using clinical expertise to interpret the pattern.

Evaluation

2.4.

Causal models learned from observational data are notoriously difficult to evaluate for correctness[58], because they model interventions that have not been observed – in our case, how a patient record would change if we removed a given source. Likewise, evaluating an unsupervised disentangling is currently an unsolved problem[31,59] – in our case, the clinical sources are either unlabeled or an existing label may group them with (perhaps many) related sources.

Typically, causal inference approaches are evaluated using synthetic data in which all causal effects are known by design. ICA has been amply validated this way[39,41,57,60,61], but synthetic validation only goes so far, because synthetic data is typically simpler and cleaner than real data. Synthetic validation doesn’t, for example, assess whether the processes observed in a particular observational dataset violate the assumptions of the analysis, nor how far from truth the results of that analysis land under those violations.

In the current work, objective evaluation of the causal model was inspired by the technique of quantitative probing[62], in which verifiable elements are compared with externally known facts, with the difference that the original work evaluated edges internal to the causal graph, and we assess edges between root nodes and the predicted target.

Clinical Background and Rationale

2.4.1.

To implement the evaluation, we trained supervised models to predict whether an indeterminate pulmonary nodule appearing in a patient record would in the following 3 years be labeled malignant or benign, and to identify the latent sources for that labelled outcome. Indeterminate pulmonary nodules are abundant in clinical care, with over 1.5 million identified each year in the US alone[63]. There are many different potential causes of lung nodules, with the vast majority representing benign disease of little consequence for patients, but a small minority being extremely consequential as they represent early stage lung cancer (Table 2). Lung cancer remains one of the most common cancers, as well as the deadliest, accounting for over 120,000 deaths every year in the US – more than colon, breast, and prostate cancers combined. The best chance for cure is early detection and surgical resection, which makes accurate identification and characterization of malignant nodules a critically important healthcare project. Current management of lung nodules is imperfect, as human detection is error-prone due to lack of expertise, CT interpretation fatigue, and communication breakdowns. In addition, radiologist estimation of the probability of malignancy is often based on morphological features in the absence of clinical context, ignoring important patient-specific clinical variables. In addition, current methods of diagnosis are costly and carry morbidity[64], and even identifying the approximately 5% of nodules that are malignant is a clinical challenge[65].

Source completeness and decomposition

2.4.2.

We collected a reference list of the causes of malignant and benign lung nodules presented by a respected literature synthesis source[66] and identified the match(es) for each listed cause among our inferred latent sources. The first step in characterizing the inferred sources was to assess them for completeness and decomposition compared to this reference list. Ideally, this would be done for all possible sources for all conditions, but that would be intractable. Instead, we chose what from the discovery model’s perspective is an arbitrary subset of sources relating to indeterminate pulmonary nodules, unknown to the model during the discovery step, and assessed that subset.

To assess completeness, we identified the fraction of conditions on the reference list for which we recovered at least one match among the inferred signatures.

To assess decomposition, we computed the average number of recovered sources that matched a listed cause, when at least one such match was found.

Causal source accuracy

2.4.3.

We next assessed the degree to which a supervised model predicting malignant/benign labels $[eqn]$ from source expressions $[eqn]$ (Section 1.2.1) identified (on average) accurate causes for the nodules by comparing the top 20 inferred causes to the reference list.

To extract inferred causes from the model, we used sample-specific Shapley additive explanation (SHAP) values[67,68]. We then aggregate these causes over all instances and compare the top causes to the reference list.

We used Random Forest, XGBoost, and Elastic Net architectures for the supervised models. One causal model $[eqn]$ of each type was trained to predict the binary malignancy label $[eqn]$ using as input the source expressions $[eqn]$ , and a corresponding associative statistical model $[eqn]$ was trained using the original variable values $[eqn]$ . Hyperparameters were optimized by cross validation independently for each of the six configurations. With the optimized hyperparameters fixed, training for each configuration was repeated with 100 different random seeds and evaluated on the common held-out test set to determine the extent of variation due to random choices made during training.

For each matched latent source, its patient-specific causal effect on the malignancy label was computed using the global mean positive sample-specific SHAP value (omitting negative values), which was visualized for comparison with the reference list. For causes of benign nodules, we computed the global mean negative SHAP value (omitting positive values). Polarity-specific means were used because for causes of malignant nodules, a positive SHAP value for a given patient indicates the source was generating disease, and a negative value indicates it was protecting against disease. Causes of benign nodules have the opposite polarity.

A SHAP value is a mathematical representation of the patient-specific causal effect of a source, taking into account (by marginalizing over) combinations with all other sources[41]. Because we allow $[eqn]$ to include nonlinear and high-order effects, identical values of $[eqn]$ in two different records may have different causal effects, depending on other sources expressed in the record.

The relative performance between causal and statistical models could legitimately lean either way by a small amount. A causal model includes only causal predictors, and the omitted non-causal predictive associations may advantage the statistical model. Additionally, information is lost when reducing from 9195 to 2000 input variables, which is done by PCA as the first step of ICA. On the other hand, the causal model learns its signatures from the much larger Discovery Set, which allows it to borrow statistical strength. Our comparison is to ensure that one is not dramatically different from the other, which would suggest a pipeline problem; whichever way the balance leans in terms of predictive accuracy, the main advantage of the causal model is in illuminating the potentially treatable sources of disease for each patient.

New candidate identification

2.4.4.

We next considered any of the top 20 causes that were not on the reference list to be potential new candidate causes. We assessed them by evaluating literature evidence from PubMed searches that they may be causes of malignant or benign nodules, or lung cancer in general. The sources are cited and reasoning for each case are noted in Tables 2 and 3.

If literature indicating that a source has a mechanistic relationship in the correct direction to lung nodules or lung cancer, it is listed as strong evidence. For example, source 1936/COPD (Table 3) named: COPD is a known independent risk factor for lung cancer due to its effect on the cellular environment, in addition to the fact that both COPD and lung cancer have a common cause of smoking[69,70]. It is the size of independent causal effects like this that the analysis is designed to estimate, even in the presence of common upstream causes.

If the signature of an inferred cause contains elements that affect cancer risk, but the signature itself doesn’t appear to be the main cause, that was assessed as moderate evidence. For example, source 361/Treated Unstable Angina (Table 4): while treating unstable angina isn’t known to directly affect lung cancer risk, newly diagnosed cancer is associated with anatomical severity of coronary artery disease in patients with high levels of inflammation,[71] and the signature contains medications such as aspirin and simvastatin that possess known anti-inflammatory and anticancer properties.[72–74].

If there is some possible but unclear connection between some signature elements and the label, then that was labeled weak evidence. For example, source 227/Pregnancy Elevated Glucose Complications (Table 3): the signature is of women with pregnancy complications and abnormal glucose tolerance, with a small probability of diabetes mellitus. Diabetes and cancer are associated[75], but this signature is more specific than that, and no other diabetes signatures ranked highly as causes. Moreover, lung cancer is not high on the list of cancers associated with diabetes,[75] and the specific relationship of Gestational Diabetes to any cancer has been only sparsely studied, with inconsistent results.

If no literature was found linking the inferred source to pulmonary nodules or cancer, that was labeled absent evidence. Sources with no evidence may be model errors or they may be true but as-yet unrecognized causes. For example, source 7/Postpartum Pregnancy No Complications (Tables 3 and 4): the signature is a general postpartum pattern with a small race differential, and positive expression increases predicted malignancy risk. Lung cancer in pregnancy is rare[76], but we can find no estimates of lung cancer probability given a pulmonary nodule in or soon after pregnancy.

Results

Learned clinical signatures

3.1.

The 2000 learned signatures in $[eqn]$ (three relevant examples in Fig. 3, all 2000 in supplemental material) cover the full range of disease found in the Discovery Set, and most of them are unrelated to lung nodules. The signatures are sparse, most indicating near-zero changes for the vast majority of the 9195 variables. Our signature plots sort the variables by the size of the change and present the top $[eqn]$ variables, where $[eqn]$ is chosen such that 97.5% of the information in the original vector (as judged by the $[eqn]$ norm) is present in those $[eqn]$ variables.

As with the signatures, source expressions are also sparse (Fig. 3, insets).

Looking in detail at some relevant examples (Fig. 3), we interpret signatures 1940 and 1198 to represent primary upper lobe and lower lobe lung cancer, respectively. They include well-established clinical manifestations[77], such as dyspnea, pneumothorax, localized swelling consistent with superior vena cava syndrome, and risk factors such as tobacco use.

We interpret signature 1452 to represent sources of primary malignancy in various locations of the genitourinary system, but predominantly bladder cancer. It includes appropriate differentials for sex and age, and known manifestations such as hydronephrosis, hematuria, and recurrent urinary tract infections.[78,79] It may represent multiple causes combined into a single source because the Discovery Set focused on lung disease and contained too few examples of genitourinary disease to resolve them.

Completeness

3.2.

For malignant etiologies, matching signatures were found for 12 of 13 (92%) listed elements, with the caveat that the learned signatures for primary lung cancer were decomposed by anatomic location rather than cell type (Table 2). For benign etiologies, matching signatures were found for 7 of 23 (30%) listed elements.

Decomposition

3.3.

Malignant etiologies with matching signatures had an average of 5.5 signatures per element (Table 2), with the maximum decomposition for Breast Cancer, which was decomposed by treatment approach and clinical course. Benign etiologies had an average of 4.1 signatures per element, with the maximum decomposition for Rheumatoid Arthritis, which was mainly decomposed by treatment approach.

Causal source accuracy and new candidate cause identification

3.4.

Of the top 20 inferred sources of malignant nodules, 6 (30%) were directly listed as causes (Table 3). Of those not listed, 7 (35%) had at least moderate evidence in other literature, 3 (15%) had weak or mixed evidence, and 4 (20%) had no prior evidence found, although one of these (1622) turned out to be a proxy for the length of the patient record (details in supplemental information).

Of the top 20 inferred sources of benign nodules, 6 (30%) were directly listed as causes (Table 4). Of those not listed, 7 (35%) had at least moderate evidence, 4 (20%) had weak evidence, and 3 (15%) had no evidence found.

Model comparison

3.5.

Predictive power of the causal models was not dramatically different from that of the associational models, although the performance of each causal model exceeded that of its corresponding associational model by 0.04 to 0.05 AUC (p = 0.058 for the Random Forest models) (supplemental Fig. A6). The modest but consistent improvement suggests that the causal predictors may have been cleaner than associational predictors, and there may have been some borrowing of statistical strength from the Discovery Set.

Performance of the XGBoost causal model (0.789) was similar to the Random Forest causal model (0.788), and both outperformed the linear causal model (0.757), suggesting that there were nonlinear effects in the causal network.

The associational XGBoost model (0.755) outperformed the associational Random Forest model (0.738), but the difference disappeared in the causal models, suggesting that the learned signatures effectively captured essential predictive information.

Of the three causal models, the top predictive signatures of the Random Forest model make the most clinical sense to our subjective clinical judgement, which we attribute to model mismatch for the linear model, and to the greediness of the XGBoost model, which also had the greatest variability over random seeds. See supplemental material Section A3.1 and Figs. A2 – A4 for more on this point.

Discussion

We have demonstrated the use of probabilistic independence to disentangle clinical signatures of latent disease sources from institutional scale, structured EHR data using an unsupervised learning algorithm. We have also inferred the estimated expression of each source in a given patient record, evaluating the process by using the estimated expressions to classify malignant vs. benign pulmonary nodules and identify the sources driving the classification. Achieving this source identification is a step toward making patient-specific treatment decisions.

In our evaluation, the discovery model recovered a total of 102 signatures matching causes directly listed in the external reference, including 92% of the malignant causes and 30% of the benign causes (Table 2). Many of the unrecovered benign conditions are rare, and simply may not have been present in our Discovery Set. Furthermore, if a nodule is proven benign by biopsy or judged benign based on radiologic follow-up, the specific benign cause is typically not elucidated, and the elements of what would have become their signatures are left unobserved. Finally, some unrecovered signatures were for very detailed conditions, such that the granularity of the input data was not sufficient to represent them.

Because sources are arbitrarily scaled such that the distribution of their expressions over the Discovery Set has zero mean and unit standard deviation, some with long tails in both directions, they can appear as either a benign (negative SHAP value) or a malignant (positive SHAP value) cause, depending on the expression polarity of a given patient. SHAP value polarity usually matches expression polarity, but not always (all panels of Fig. 6 show mismatched polarity).

The polarity mismatch is usually due to the sign ambiguity of ICA inference[57,68] (our heuristic is to assign the sign such that the dominant element is a positive change), but the polarity of signatures containing malignancy codes (signatures 886, 1940, 1198, and 1452) are a special case that is at first glance counterintuitive (Fig. 6). All signatures were learned from the broad Discovery Set comprising patients with many kinds of lung disease, including cancer, and the patients with lung cancer in the Discovery Set have positive expressions for the lung cancer sources, as expected. But the Evaluation Set from which the supervised model was trained contained only patients with no history of any cancer. Therefore, no record in the sets from which the predictive models were trained or tested contained any cancer billing codes (the light-colored bars in Fig. 3). We might expect these records to have near zero expression for these sources, but that is not what happened. Instead, we determined experimentally that if a record perfectly matches one of these cancer signatures, except it has no cancer codes, it produces a negative expression (data not shown). This behavior depends upon the other signatures in $[eqn]$ , because the expressions $[eqn]$ , are a complex and dense mapping $[eqn]$ from $[eqn]$ to $[eqn]$ (while the mapping in the reverse direction is simple and sparse). In the case of these specific signatures in our matrix $[eqn]$ , records with a partial match to the signature, absent the major cancer billing code elements, express the source with a negative polarity.

The negative expressions of these unusual sources were identified by the supervised model as top predictors of a malignant nodule (Fig. 4) suggesting that these records may be indicating the presence of undiagnosed cancer in a patient. This can’t be an artifact of future information leaking into the past by treatment being started without cancer codes entered, because cancer treatment elements are not included as important members of the signatures. (There are many signatures that contain cancer treatment elements, but these have near zero expression in all Evaluation Set patients.).

To investigate whether these partial matches to cancer signatures actually represent patients with undiagnosed cancer, we overlay markers in Fig. 4 to indicate the records that contain future primary malignancies of upper lobe, lower lobe, and bladder, and find that they fall overwhelmingly into the appropriate regions (quantified by the probability curve below the scatter plot), providing evidence that the model had indeed detected specific undiagnosed cancer. Many other records without cancer also fall into the same region, so a positive SHAP value for a single signature has low PPV for cancer. But this is consistent with the magnitude of the values themselves, up to about a 0.02 probability increase.

Additionally, the behavior of the bladder cancer source is interesting in an unexpected way. The plot shows records containing future bladder cancer falling on both sides of zero for both axes, although they are more concentrated in the appropriate region. However, only records in the appropriate region contained a malignant pulmonary nodule. The small numbers prevent firm conclusions, but these results suggest that the signature may distinguish undiagnosed bladder cancer that has metastasized to the lung from undiagnosed bladder cancer that has not, depending on whether the partial match produces a negative or positive expression.

It is curious that non-respiratory primary cancers in the reference list other than bladder were not higher on the list of inferred causes, although melanoma and kidney cancer are also ranked fairly high. Perhaps this indicates something different about the behavior of bladder cancer compared to the others, such as a greater tendency to metastasize before diagnosis, whereas malignancies such as breast cancer are the focus of intense screening efforts and may be more often diagnosed before metastasis.

Limitations and future directions

4.1.

The discovery model produced an average decomposition of 5.5 signatures per listed cause when at least one match was found for the malignant nodules, and 4.1 per listed cause of benign nodules. A large fraction of the decompositions for common conditions like breast cancer and rheumatoid arthritis were by treatment approach or disease course, which is superficially exactly what we desired, but probably operationally insufficient. These decompositions were made from data recorded after treatment was decided and after the disease had run part of its course, but what we want is to distinguish mechanisms before that happens, to suggest optimal treatment. It may be that these decompositions can function the same way that our cancer signatures did for undiagnosed disease, where patients express a source in a salient way before the major, definitive elements are recorded. But it may also be that the signatures instead represent the effects of clinical processes, such as the different treatment preferences of different clinicians[97].

A striking result of the causal model is the multiplicity of sources that make causal contributions for a given record. This is most evident in the fact that the largest maximum causal effect over all test-set records, averaged over the 100 random seeds, was less than 0.03. Which means that in the best case, the expression of that source by a record increased the probability of malignancy by less than 0.03. Typically, tens of sources combine to produce the final prediction for a single record (data not shown). This seems reasonable for our selected clinical problem, given that human clinicians also have difficulty predicting the nature of an indeterminate pulmonary nodule using the patient’s clinical history. If there were just a few salient attributes that could predict malignancy with any reliability, we might expect clinicians to already be using them for diagnosis. The causal model’s average performance (AUC 0.788) is promising, but also demonstrates that there are nodule causes unknown to the model. These may be the etiologies in Table 2 that lack a matching signature, or they may be causes yet to be discovered.

Additionally, the absence of signatures corresponding to histological properties of lung nodules, such as non-small cell cancer or adenocarcinoma, implies that the signatures are built on insufficiently detailed observations. Extending the observed variables included in the Discovery Set to other data sources that capture these details may provide higher-resolution signatures.

Errors in source and signature inference may also have arisen from limitations inherent to ICA: linearity assumptions in source discovery, the possibility of converging to local minima, and allowing small residual dependency between sources. In addition, the dimensionality reduction from 9,195 original variables to 2,000 sources almost certainly affected some of the relevant signatures by combining important sources. Improving the implementation of ICA or Direct LiNGAM to scale up even further is an important direction for future work.

Our model is also subject to the standard assumptions of causal inference. For example, we assume no directed cycles in the causal graph, no unobserved single-cause confounders, and linear functions $[eqn]$ described in Section 1.2 for inferring the value of the latent sources from observations. While relaxation of these assumptions can be accommodated[45,46,95,] (and our experimental design tries to minimize the violations, such as by using nearly 10,000 variables for each record), the accommodations add complexity that we wanted to avoid here. They may be fruitful directions for future work.

Genetic variants, environmental factors, or social and behavioral determinants of health may be confounders in this analysis, and it may be useful to include them in future models. However, the risk of unobserved confounders is minimized by our design, which (approximately) eliminates multiple-source confounding among the inferred latent sources. The design retains the assumption of no unobserved single-cause confounding, which is a weaker assumption that there is no unobserved confounder that affects a single source $[eqn]$ and the label $[eqn]$ (supplemental Section A1.6.1).

Finally, while our models have produced clinically plausible sources for patients’ malignant or benign pulmonary nodules that align with the published literature on lung cancer and suggest the detection of undiagnosed disease, we have not established these sources as definitive. Caution is indicated by the fact that the three different model architectures produced different estimated causal effects for each, depending on the inductive bias of the model (Supplemental section A3.1 and Figs. A2–A4).

Conclusion

We have demonstrated the unsupervised learning of clinical signatures in routinely collected EHR data, using probabilistic independence as the guiding principle. This principle produces signatures corresponding to unobserved root nodes in the inferred causal graph of the observed data. Clinically, the signatures represent the imprint on the medical record of unobserved sources of disease, and a prediction model using the expression of those sources as input becomes a patient-specific causal model for the predicted target, which is a step in the direction of identifying patient-specific optimal treatment. In the absence of established definitive methods to evaluate disentangling or causal models, we evaluated the signatures by using them to predict the malignant or benign nature of pulmonary nodules in 13,252 patients, and then comparing the inferred patient-specific causes with known etiologies in the medical literature. The model left unrecovered some infrequently observed or narrowly detailed causes of benign nodules, but, surprisingly, it also identified apparently undiagnosed cancer in many patients.

Supplementary Material

Supplement 2

Supplement 3

Supplement Appendix A

Bibliography98

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anderson GP, Endotyping asthma: new insights into key pathogenic mechanisms in a complex, heterogeneous disease, Lancet Lond Engl 372 (2008) 1107–1119, 10.1016/S 0140-6736(08)61452-X.18805339 · doi ↗ · pubmed ↗
2John MR, Goate A, Colin LM, Nigel JC, Danek A, Graff-Radford N, , Genetic heterogeneity in alzheimer disease and implications for treatment strategies, Curr Neurol Neurosci Rep 14 (2014), 10.1007/s 11910-014-0499-8.PMC 416298725217249 · doi ↗ · pubmed ↗
3Gutmann DH, Eliminating barriers to personalized medicine: Learning from neurofibromatosis type 1, Neurology 83 (2014) 1–9, 10.1212/WNL.0000000000000652.PMC 413256724975854 · doi ↗ · pubmed ↗
4Tuomi T, Santoro N, Caprio S, Cai M, Weng J, Groop L, The many faces of diabetes: a disease with increasing heterogeneity, Lancet 383 (2014) 1084–1094, 10.1016/S 0140-6736(13)62219-9.24315621 · doi ↗ · pubmed ↗
5Barnes PJ, Cellular and molecular mechanisms of asthma and COPD, Clin Sci 131 (2017) 1541–1558, 10.1042/CS 20160487.28659395 · doi ↗ · pubmed ↗
6Oksel C, Haider S, Fontanella S, Frainay C, Custovic A, Classification of Pediatric Asthma: From Phenotype Discovery to Clinical Practice, Front Pediatr 6 (2018).10.3389/fped.2018.00258 PMC 616073630298124 · doi ↗ · pubmed ↗
7Gul ZG, Kaplan SA, BPH: Why Do Patients Fail Medical Therapy? Curr Urol Rep 20 (2019) 40, 10.1007/s 11934-019-0899-z.31168725 · doi ↗ · pubmed ↗
8Schoettler N, Strek ME, Recent Advances in Severe Asthma: From Phenotypes to Personalized Medicine, Chest 157 (2020) 516–528, 10.1016/j.chest.2019.10.009.31678077 PMC 7609962 · doi ↗ · pubmed ↗