Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

Fan Ma; Yuntian Liu; Xiang Lan; Weipeng Zhou; Jun Ni; Mauro Giuffr\`e; Lingfei Qian; Xueqing Peng; Yujia Zhou; Ruey-Ling Weng; Huan He; Lu Li; Huiyuan Wang; Qingyu Chen; Andrew Loza; Laila Rasmy; Degui Zhi; Yuan Lu; Chenjie Zeng; Joshua C Denny; Lee Schwamm; Daniella Meeker; Lucila Ohno-Machado; Yong Chen; and Hua Xu

arXiv:2605.02740·cs.AI·May 7, 2026

Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

Fan Ma, Yuntian Liu, Xiang Lan, Weipeng Zhou, Jun Ni, Mauro Giuffr\`e, Lingfei Qian, Xueqing Peng, Yujia Zhou, Ruey-Ling Weng, Huan He, Lu Li, Huiyuan Wang, Qingyu Chen, Andrew Loza, Laila Rasmy, Degui Zhi, Yuan Lu, Chenjie Zeng, Joshua C Denny, Lee Schwamm, Daniella Meeker

PDF

TL;DR

This paper introduces ReClaim, a large-scale generative transformer trained on medical claims data, demonstrating improved disease prediction, expenditure forecasting, and real-world evidence generation across diverse healthcare tasks.

Contribution

ReClaim is the first large-scale healthcare foundation model trained on nationwide claims data, outperforming existing models in disease prediction and real-world evidence applications.

Findings

01

ReClaim achieved a mean AUC of 75.6% on disease-onset prediction tasks.

02

Scaling the model improved performance monotonically and added significant gains over pre-training.

03

ReClaim enhanced healthcare expenditure forecasting and reduced bias in target trial emulation.

Abstract

Evidence derived from large-scale real-world data (RWD) is increasingly informing regulatory evaluation and healthcare decision-making. Administrative claims provide population-scale, longitudinal records of healthcare utilization, expenditure, and detailed coding of diagnoses, procedures, and medications, yet their potential as a substrate for healthcare foundation models remains largely unexplored. Here we present ReClaim, a generative transformer trained from scratch on 43.8 billion medical events from more than 200 million enrollees in the MarketScan claims data spanning 2008-2022. ReClaim models longitudinal trajectories across diagnoses, procedures, medications, and expenditure, and was scaled to 140 million, 700 million, and 1.7 billion parameters. Across over 1,000 disease-onset prediction tasks, ReClaim achieved a mean AUC of 75.6%, substantially outperforming disease-specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.