Phenotype Inference with Semi-Supervised Mixed Membership Models
Victor Rodriguez, Adler Perotte

TL;DR
This paper introduces SS3M, a semi-supervised probabilistic model that efficiently learns disease phenotypes from clinical data with minimal labeled examples, improving interpretability and disease specificity.
Contribution
The paper presents SS3M, a novel semi-supervised mixed membership model that addresses the limitations of existing phenotyping methods by requiring fewer labels and producing interpretable disease phenotypes.
Findings
SS3M effectively learns disease-specific phenotypes from limited labeled data.
The model produces interpretable phenotypes aligned with clinical characteristics.
SS3M outperforms purely supervised or unsupervised methods in phenotype quality.
Abstract
Disease phenotyping algorithms process observational clinical data to identify patients with specific diseases. Supervised phenotyping methods require significant quantities of expert-labeled data, while unsupervised methods may learn non-disease phenotypes. To address these limitations, we propose the Semi-Supervised Mixed Membership Model (SS3M) -- a probabilistic graphical model for learning disease phenotypes from clinical data with relatively few labels. We show SS3M can learn interpretable, disease-specific phenotypes which capture the clinical characteristics of the diseases specified by the labels provided.
Click any figure to enlarge with its caption.
Figure 1
Figure 2| Coherence | Granularity | True label matches phen.? | True label matches expert’s? | ||||||||||||
| 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 1 | 2 | 3 | 4 | 5 | Y | N | |
| SS3M | 20% | 8% | 14% | 36% | 22% | 24% | 32% | 44% | 36% | 16% | 12% | 22% | 14% | 26% | 74% |
| MC3M | 18% | 28% | 32% | 20% | 2% | 42% | 42% | 16% | – | – | – | – | – | – | – |
| SS3M | MC3M-SP | MC3M | Raw Tokens | ||||||||
| smpl | smpl | fix | fix | LR | NB | LR | NB | LR | NB | ||
| smpl | fix | smpl | fix | ||||||||
| AUROC | micro | 0.639 | 0.653 | 0.719 | 0.720 | 0.865 | 0.805 | 0.837 | 0.826 | 0.884 | 0.803 |
| macro | 0.595 | 0.647 | 0.695 | 0.714 | 0.810 | 0.763 | 0.779 | 0.778 | 0.840 | 0.780 | |
| AUPRC | micro | 0.187 | 0.156 | 0.299 | 0.249 | 0.427 | 0.283 | 0.353 | 0.308 | 0.527 | 0.309 |
| macro | 0.162 | 0.187 | 0.252 | 0.245 | 0.277 | 0.215 | 0.232 | 0.229 | 0.426 | 0.236 | |
| Log-likelihood | – | ||||||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Biomedical Text Mining and Ontologies · Artificial Intelligence in Healthcare
Phenotype Inference with Semi-Supervised Mixed Membership Models
Victor A. Rodriguez
Department of Biomedical Informatics
Columbia University
New York, NY 10025
&Adler Perotte
Department of Biomedical Informatics
Columbia University
New York, NY 10025
Abstract
Disease phenotyping algorithms process observational clinical data to identify patients with specific diseases. Supervised phenotyping methods require significant quantities of expert-labeled data, while unsupervised methods may learn non-disease phenotypes. To address these limitations, we propose the Semi-Supervised Mixed Membership Model (SS3M) – a probabilistic graphical model for learning disease phenotypes from clinical data with relatively few labels. We show SS3M can learn interpretable, disease-specific phenotypes which capture the clinical characteristics of the diseases specified by the labels provided.
1 Introduction
Phenotypes are powerful tools for working with observational clinical data in the absence of reliable disease labels [1]. Disease-specific phentoypes allow researchers to sift through large-scale clinical data stores to identify patients with evidence of specific clinical conditions. By answering the question of who has what disease, phenotypes power essential tasks such as cohort selection, trial recruitment and clinical outcome prediction [1, 2, 3, 4].
Traditionally, phenotypes were developed by groups of clinical experts who painstakingly hand-tuned rule-based algorithms. The limited scalability of this approach has led to the development of automated methods for learning phenotypes directly from clinical data. Many studies in this vein utilize supervised machine learning methods to build phenotyping algorithms [5, 6]. Though this approach avoids laborious expert knowledge engineering, it requires significant amounts of labeled clinical data generated by manual chart review.
To avoid costly, expert-generated disease labels, many authors have utilized unsupervised methods to cluster patients according to underlying patterns in their clincal data [7, 8, 9, 10, 11]. In this setting, such patterns play the role of phenotypes. Unsupervised phenotyping methods often learn multiple phenotypes simultaneouly, which may confer evidence of specific diseases. However, such phenotypes generally are not guaranteed to represent single disease concepts. This complicates their evaluation and use in downstream tasks.
In this paper we propose the Semi-Supervised Mixed Membership Model (SS3M), a probabilistic graphical model which utilizes relatively few disease labels to learn multiple disease-specific phenotypes from observational clinical data. SS3M addresses the limitations of supervised phenotyping by reducing the amount of labeled data needed to learn disease phenotypes; patients do not need to have labels for all diseases. SS3M also addresses the limitations of unsupervised phenotyping by associating disease labels with the phenotypes to be learned; a label specifies which disease a phenotype is meant to represent. This simplifies the evaluation of SS3M phenotypes.
Our overall goal is to utilize SS3M in an active learning framework, where we may request small batches of labels from clinical experts as needed to learn a phenotype for a given disease. In the present work, we aim to evaluate SS3M’s capacity for learning disease-specific phenotypes by utilizing readily available ICD9 diagnosis codes as our labels. We consider this phase in our work a crucial step towards obtaining a reliable semi-supervised model.
2 Semi-Supervised Mixed Memberships Models
SS3M is closely related to the popular topic model Latent Dirichlet Allocation (LDA) [12], and its multi-modal [13] and supervised [14] extensions. We detail the generative process for SS3M and provide a graphical model in the appendix.
Inference
We implement a Gibbs sampler to sample latent variables from SS3M’s posterior distribution. Conjugacy gives us the complete conditional distributions for the patient-phenotype distributions, , and phenotype-token distributions, , in closed form. The complete conditionals for phenotype assignments, , and phenotype activations, , are easily normalized. We use Hamiltonian Monte Carlo to sample from the complete conditionals of and [15]. Setting the path length and step size yielded stable trajectories with high acceptance rates in preliminary experiments.
3 Experiments
3.1 Dataset
We train all our models using clinical data extracted from the Medical Information Mart for Intensive Care version III (MIMIC-III) [16]. Our dataset is restricted to the first hospital admission of 46,520 neonatal and adult patients. Clinical notes, labs, and medications comprise corpora of clinical observations. The 50 most common ICD9 diagnosis codes form our label set. Tokenized clinical observation corpora were preprocessed to remove stopwords as well as low- and high-frequency tokens. The final preprocessed corpora and labels were then split into training and test sets containing data for 80% and 20% of patients respectively.
3.2 SS3M
We train SS3M to learn phenotypes; 50 labeled phenotypes (one for each ICD9 diagnosis code label) and 20 unlabeled phenotypes. In preliminary experiments, we found the addition of unlabeled phenotypes improved the overall interpretability of their labeled counterparts.
For a given patient, the model may treat the absence of a label in two ways. In the first, a missing label indicates that the label’s activation, , should be set to 0. In the second, an absent label indicates the corresponding activation should be estimated. This latter case is meant to model the uncertainty associated with assignment of diagnosis codes in clinical settings. We train seperate models to explore both cases.
During inference, we have the option to sample the and or hold them fixed. Preliminary experiments suggest holding these variables fixed results in more interpretable labeled phenotypes. We explore both cases initializing the and with draws from Gamma distributions with shape and scale parameters and respectively.
In all experiments we set and , and sampled latent variables for 200 iterations.
3.3 Quantitative evaluation
In all cases involving mixed membership models, global latent variables are learned on the training set. Trained global variables are then loaded into models exposed to patient data in the test set.
Label prediction
For SS3M, we estimate each test patient’s posterior likelihood for each disease label. We evaluate against logistic regression (LR) and naive Bayes (NB) models trained with raw tokens or patient-phenotype distributions learned with either of two mixed membership models. The first of these is the multi-channel mixed membership model (MC3M) – an unsupervised model similar in structure to SS3M but with a simple Dirichlet prior on the patient-phenotype distributions, [13]. The second is SS3M trained without labels. This model is effectively MC3M with a structured prior (MC3M-SP) on the .
Log-likelihood
We report the maximum complete data log-likelihoods observed during training of each of our three mixed membership models: SS3M, MC3M, and MC3M-SP.
3.4 Qualitative evaluation
Here we leverage the judgement of a clinical expert to asses the quality of SS3M phenotypes relative to MC3M phenotypes. In particluar, we aim to evaluate the coherence, granularity, and label quality of learned phenotypes. We asked our clinical expert to complete the following tasks, which were inspired by the expert evaluations detailed in Pivovarov et al. [13]
Coherence
A coherent phenotype was defined as a phenotype containing observations typical of a single disease while omitting observations atypical of said disease. The clinical expert was asked to rate the coherence of individual phenotypes using a five-point Likert scale, with 1 and 5 signifying low and and high coherence respectively.
Granularity
We defined three categories of phenotype granularity: (1) non-disease, (2) mixture of diseases, (3) single disease. We asked our expert to assign each phenotype to one of these categories.
Label quality
We asked our clinical expert to generate a label for each phenotype. If no such label came to mind, the expert was asked to omit this step. If the phenotype in question was learned using SS3M, the expert was asked if their label was equivalent to the phenotype’s true label. In addition, the expert was asked to specify how well the true label matched its learned phenotype using a five-point Likert scale with 1 indicating no match and 5 a perfect match.
The phenotypes for our qualitative evaluations were learned using SS3M and MC3M models with . The 50 labeled SS3M phenotypes and 50 randomly chosen MC3M phenotypes were shuffled together. During the evaluation, a single phenotype was drawn and shown to our expert who then completed the tasks above. This was repeated until all 100 phenotypes had been evaluated.
4 Results
Our clinical expert judged SS3M phenotypes to have higher coherence and granularity relative to MC3M phenotypes. Though less than a quarter of the expert’s labels were found to match the true labels for SS3M phenotypes, nearly half (48%) of these phenotypes were found to be good matches for their true labels (Likert score ). A summary of the results from our qualitative evaluation is presented in Table 1. A sample of phenotypes judged to match both expert and true labels are shown in Figure 1.
Nearly all multilabel classification baselines displayed superior performance relative to SS3M in micro and macro averaged AUROC and AUPRC. Overall, logistic regression trained on raw tokens displayed the best performance. Notably, among SS3M models, the simplest model holding missing label activations fixed (fix ) and fixing and (fix ) was best in micro and macro averaged AUROC. Meanwhile, the SS3M model with best micro and macro averaged AUPRC had fixed missing activations and sampled and (smpl ).
The overall best log-likelihood was obtained with our most flexible mixed membership model: SS3M with sampled activations and sampled and . A summary of the results from our quantitative evaluation is presented in Table 2
5 Discussion
SS3M is a model constructed for learning disease phenotypes from observational clinical data in a semi-supervised or active learning setting. As a preliminary step toward this goal, we have exposed the model to MIMIC-III clinical data and derived phenotype labels from common ICD9 diagnosis codes. In quantitative evaluations, SS3M was generally outperformed by baselines in predicting disease labels on held out patient data. Meanwhile, in qualitative evaluations conducted by a clinical expert, SS3M phenotypes were judged to be more coherent, and more granular than phenotypes learned with a related unsupervised model. Moreover, SS3M phenotypes were often found to capture the clinical characteristics of the clinical entities specified by their associated labels. These results suggest that despite its weak performance on label prediction, SS3M does indeed appear capable of learning interpretable, disease-specific phenotypes from clinical data.
SS3M’s weak predictive performance warrants some scrutiny. Suboptimal setting of the model’s fixed parameters may be to blame. Indeed, in this work we did little to tune the model’s hyperparameters or the total number of phenotypes to be learned. Exploring various settings of these fixed parameters is likely to produce a model with improved predictive performance on held-out test data. It is also possible that our choice of labels is hindering SS3M’s performance. ICD9 codes are notoriously noisy proxies for ground-truth diagnosis. This leaves open the possibility that SS3M may actually be generating correct labels for held-out patients who simply lack the appropriate labels for their true underlying clinical conditions. We may better explore this possibility by incorporating SS3M into an active learning framework where high-fidelity labels may be requested as needed from a clinician.
6 Conclusion
SS3M learns disease-specific phenotypes from labeled clinical data. Future work will explore alternative parameterizations to optimize the model’s performance on multi-label prediction and utilize higher fidelity label sets. The positive qualitative assessment of SS3M phenotypes encourages us to continue developing SS3M for use in truly semi-supervised or active learning settings.
Appendix
The generative process for SS3M is detailed below. The corresponding graphical models appears in Figure 2
For each phenotype and data source draw a phenotype-token distribution . 2. 2.
For each phenotype draw . 3. 3.
Draw . 4. 4.
For each patient :
- (a)
For each phenotype draw phenotype activations 2. (b)
Draw a patient-phenotype distribution 3. (c)
For each data source and observation
- i.
Draw a phenotype assignment 2. ii.
Draw an observed token
In steps 2, 3 and 4(a) we introduce patient-level binary phenotype activations and global latent variables and which interact to parameterize the Dirichlet prior on each in step 4(b). For patient , if activation then the parameter to the Dirichlet prior on is set to ; otherwise, it is set to .
Activations may be estimated, or they may be held fixed in accordance with a set of binary labels reflecting the presence (1) or absence (0) of specific diseases. In this latter case, the presence of a disease label corresponding to phenotype for patient would set during inference. This constrains the model to use in the prior to instead of . The goal of introducing this structure is to encourage the model to learn values for the which push the to place most of their mass over phenotypes with positive activations. Thus, by linking activations to labels, a patient’s data is driven toward informing inference of phenotypes that capture the distinct clinical meanings associated their labels.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Hripcsak, George, and David J. Albers. "Next-generation phenotyping of electronic health records." Journal of the American Medical Informatics Association 20.1 (2012): 117-121.
- 2[2] Richesson, Rachel L., et al. "Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory." Journal of the American Medical Informatics Association 20.e 2 (2013): e 226-e 231.
- 3[3] Richesson, Rachel L., et al. "Clinical phenotyping in selected national networks: demonstrating the need for high-throughput, portable, and computational methods." Artificial intelligence in medicine 71 (2016): 57-61.
- 4[4] Pathak, Jyotishman, Abel N. Kho, and Joshua C. Denny. "Electronic health records-driven phenotyping: challenges, recent advances, and perspectives." (2013): e 206-e 211.
- 5[5] Bergquist, Savannah L., et al. "Classifying lung cancer severity with ensemble machine learning in health care claims data." Machine Learning for Healthcare Conference. 2017.
- 6[6] Esteban, Santiago, et al. "Development and validation of various phenotyping algorithms for Diabetes Mellitus using data from electronic health records." Computer methods and programs in biomedicine 152 (2017): 53-70.
- 7[7] Joshi, Shalmali, et al. "Identifiable phenotyping using constrained Non-Negative matrix factorization." ar Xiv preprint ar Xiv:1608.00704 (2016).
- 8[8] Ho, Joyce C., et al. "Limestone: High-throughput candidate phenotype generation via tensor factorization." Journal of biomedical informatics 52 (2014): 199-211.
