TL;DR
This paper introduces CATCH-FM, a large-scale healthcare foundation model trained on electronic health records to pre-screen for cancer risk, enabling early detection with high accuracy and robustness across diverse populations.
Contribution
The paper develops and pretrains large EHR foundation models up to 2.4 billion parameters, demonstrating their effectiveness in cancer risk prediction from medical records.
Findings
Achieves 50% sensitivity at 99% specificity in predicting first cancer risks.
Outperforms feature-based models and general medical LLMs by up to 20% AUPRC.
Demonstrates robustness across diverse patient demographics and healthcare systems.
Abstract
Cancer screening, leading to early detection, saves lives. Unfortunately, existing screening techniques require expensive and intrusive medical procedures, not globally available, resulting in too many lost would-be-saved lives. We present CATCH-FM, CATch Cancer early with Healthcare Foundation Models, a cancer pre-screening methodology that identifies high-risk patients for further screening solely based on their historical medical records. With millions of electronic healthcare records (EHR), we establish the scaling law of EHR foundation models pretrained on medical code sequences, pretrain compute-optimal foundation models of up to 2.4 billion parameters, and finetune them on clinician-curated cancer risk prediction cohorts. In our retrospective evaluation comprising of thirty thousand patients, CATCH-FM achieves strong efficacy, with 50% sensitivity in predicting first cancer risks…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper conducts extensive experiments and ablation studies to show the scaling law of the foundation EHR model on multiple tasks and two datasets. - The proposed model outperforms multiple baselines including ML, DL and LLM models.
- The methods are not quite different from other previous works such as CLMBR. Although it is pretrained a different dataset, the contribution is limited. - The details of preprocessing the EHR dataset are not clearly stated (patient sample filtering and splitting). Some statistics of the dataset read counter-intuitive. (1) The prevalence of pancreatic cancer should be much less than lung and liver cancer (5-10 times less). However, Table 2 shows the positive rates are quite close among three
1. The model achieves strong and consistent performance across diverse datasets, demonstrating clear advantages over traditional and prior foundation model baselines. 2. The demonstration of scaling laws for EHR-based foundation models is impressive, providing valuable insights into how model size and data scale influence healthcare AI performance. 3. The proposed approach shows strong generalization across different populations and healthcare systems, highlighting its robustness and real-worl
1. Some experiment settings are questionable. For instance, Figure 3 and Table 5 compare Qwen-2.5-500m with CATCH-FM-2.4b, where notable parameter differences exist. More fair comparison should have been conducted with comparable amount of parameters. Moreover, it will be beneficial if larger-scale Qwen model can be used in experiment, as the 500m variant is not very common and may fail to serve as a competitive baseline. 2. Some experiment results are strange without any explanation. For exam
- AUROC and AUPRC are good evaluation metrics
- It's unclear what the contribution is? This is a reasonable case study of applying next token pretraining, but I'm unsure what new information this adds for the ICLR audience. - Lack of code limits my ability to review, which is especially important when the most important experiments are on-non public datasets. - Evaluation on only cancer is a weakness, it would be good to measure the methods on a variety of outcomes - Cumulative duration matching for controls is incorrect, as it leaks infor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
