Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research
Benjamin Birnbaum, Nathan Nussbaum, Katharina Seidl-Rathkopf, Monica, Agrawal, Melissa Estevez, Evan Estola, Joshua Haimson, Lucy He, Peter Larson,, Paul Richardson

TL;DR
This paper presents MACS, a machine learning method that improves cohort selection from EHR data for oncology research while assessing and minimizing bias, demonstrated on metastatic breast cancer patients.
Contribution
The study introduces MACS with bias analysis, a novel approach to enhance cohort selection efficiency and evaluate bias in EHR-based oncology research.
Findings
High model performance with AUC of 0.976
77.9% efficiency gain in abstraction process
No significant bias detected in cohort comparison
Abstract
Objective Electronic health records (EHRs) are a promising source of data for health outcomes research in oncology. A challenge in using EHR data is that selecting cohorts of patients often requires information in unstructured parts of the record. Machine learning has been used to address this, but even high-performing algorithms may select patients in a non-random manner and bias the resulting cohort. To improve the efficiency of cohort selection while measuring potential bias, we introduce a technique called Model-Assisted Cohort Selection (MACS) with Bias Analysis and apply it to the selection of metastatic breast cancer (mBC) patients. Materials and Methods We trained a model on 17,263 patients using term-frequency inverse-document-frequency (TF-IDF) and logistic regression. We used a test set of 17,292 patients to measure algorithm performance and perform Bias Analysis. We compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Biomedical Text Mining and Ontologies · Colorectal Cancer Screening and Detection
MethodsTest
