Domain constraints improve risk prediction when outcome data is missing

Sidhika Balachandar; Nikhil Garg; Emma Pierson

arXiv:2312.03878·cs.LG·April 23, 2024·2 cites

Domain constraints improve risk prediction when outcome data is missing

Sidhika Balachandar, Nikhil Garg, Emma Pierson

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces a Bayesian model with domain constraints to improve risk prediction accuracy for both tested and untested patients when outcome data is missing, especially in health-related decision-making contexts.

Contribution

It proposes two novel domain constraints—prevalence and expertise constraints—that enhance parameter inference in models with missing outcome data.

Findings

01

Domain constraints improve model inference accuracy.

02

The model effectively predicts cancer risk and captures testing policies.

03

It can identify suboptimal test allocation strategies.

Abstract

Machine learning models are often trained to predict the outcome resulting from a human decision. For example, if a doctor decides to test a patient for disease, will the patient test positive? A challenge is that historical decision-making determines whether the outcome is observed: we only observe test outcomes for patients doctors historically tested. Untested patients, for whom outcomes are unobserved, may differ from tested patients along observed and unobserved dimensions. We propose a Bayesian model class which captures this setting. The purpose of the model is to accurately estimate risk for both tested and untested patients. Estimating this model is challenging due to the wide range of possibilities for untested patients. To address this, we propose two domain constraints which are plausible in health settings: a prevalence constraint, where the overall disease prevalence is…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1. The paper is very well-written, readers can easily follow the motivation, problem formulation and their experimental design. 2. I appreciate the experiments trying to run experiments in real breast cancer dataset. The experiments in a setting where outcomes for non-tested patients are missing is a very difficult setting. 3. The paper addresses a significant problem where the outcomes of the patients that are tested are missing and there is distributional shift between tested and untested pat

Weaknesses

1. I think the paper has limited novelty. The linear risk setting has been considered before as cited in the paper before [(Hicks, 2021)]. This paper aims to add two more constraints: prevalence constraint and expertise constraint. The expertise constraint sets one of the variables to 0 - could be easily addressed by dropping that feature in the dataset, and prevalence constraint sets the expectation of the outcome -- could be addressed by normalizing the feature space and adding a bias term. I

Reviewer 02Rating 8· accept, good paperConfidence 2

Strengths

- The paper is well motivated. - he constraints introduced are logical and reasonable. - Both theoretical and empirical analysis show improved performance.

Weaknesses

- The chosen Bernoulli-sigmoid model may be overly simplistic. Especially in the healthcare field, the intricate relationship between features and labels might not be fully represented by this basic model. - The empirical tests were limited to only 7 features, raising questions about the model's scalability with a larger feature set. - Section 5.2's results are somewhat ambiguous. For instance, in the subsection "Inferred risk predicts breast cancer diagnoses," it would be beneficial to include

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

1. The proposed hierarchical Bayesian model to address the unobservables and connect it with the actual observation to evaluate the risk score and test decision makes sense and is novel. 2. The prevalence constraint and expertise constraint used to shrink the estimation uncertainty is novel. In practice, the two constraints are usually easy to access, making such constraints practically useful. 3. The authors demonstrated in synthetic data that the constraints proposed can effectively reduce

Weaknesses

1. The actual Bayesian model derived from Proposition 3.1 seems too simple in practice. Having the assumption that the unobservable always comes from an independent normal distribution can be too strong. 2. When applying the model to UK Biobank, filtering out individuals whose age is below 45 is not convincing.

Videos

Domain constraints improve risk prediction when outcome data is missing· slideslive

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Colorectal Cancer Screening and Detection