A Semi-Supervised Machine Learning Approach to Detecting Recurrent Metastatic Breast Cancer Cases Using Linked Cancer Registry and Electronic Medical Record Data
Albee Y. Ling, Allison W. Kurian, Jennifer L. Caswell-Jin, George W., Sledge Jr., Nigam H. Shah, Suzanne R. Tamang

TL;DR
This study presents a semi-supervised machine learning framework that effectively detects recurrent metastatic breast cancer cases by integrating linked electronic medical records and cancer registry data, enhancing population-based cancer research.
Contribution
The paper introduces a novel semi-supervised approach combining EMR and registry data for accurate MBC detection without expert-labeled training data.
Findings
Model achieved AUC of 0.925 in detecting MBC.
High sensitivity (0.861) and specificity (0.878) in classification.
Framework enables large-scale population research on cancer recurrence.
Abstract
Objectives: Most cancer data sources lack information on metastatic recurrence. Electronic medical records (EMRs) and population-based cancer registries contain complementary information on cancer treatment and outcomes, yet are rarely used synergistically. To enable detection of metastatic breast cancer (MBC), we applied a semi-supervised machine learning framework to linked EMR-California Cancer Registry (CCR) data. Materials and Methods: We studied 11,459 female patients treated at Stanford Health Care who received an incident breast cancer diagnosis from 2000-2014. The dataset consisted of structured data and unstructured free-text clinical notes from EMR, linked to CCR, a component of the Surveillance, Epidemiology and End Results (SEER) database. We extracted information on metastatic disease from patient notes to infer a class label and then trained a regularized logistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLogistic Regression
