Measuring Model Performance in the Presence of an Intervention
Winston Chen, Michael W. Sjoding, Jenna Wiens

TL;DR
This paper introduces a new evaluation method for AI models in intervention settings that uses all available data from RCTs, reducing bias and improving model selection accuracy.
Contribution
It develops a theoretically grounded, unbiased evaluation approach called nuisance parameter weighting (NPW) that leverages data from both treatment and control groups.
Findings
NPW reduces bias in model evaluation.
NPW outperforms standard methods in synthetic and real datasets.
Improves model selection accuracy in intervention contexts.
Abstract
AI models are often evaluated based on their ability to predict the outcome of interest. However, in many AI for social impact applications, the presence of an intervention that affects the outcome can bias the evaluation. Randomized controlled trials (RCTs) randomly assign interventions, allowing data from the control group to be used for unbiased model evaluation. However, this approach is inefficient because it ignores data from the treatment group. Given the complexity and cost often associated with RCTs, making the most use of the data is essential. Thus, we investigate model evaluation strategies that leverage all data from an RCT. First, we theoretically quantify the estimation bias that arises from na\"ively aggregating performance estimates from treatment and control groups and derive the condition under which this bias leads to incorrect model selection. Leveraging these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Causal Inference Techniques · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education
