Developing Large Language Models for Clinical Research Using One Million Clinical Trials
Zifeng Wang, Jiacheng Lin, Qiao Jin, Junyi Gao, Jathurshan Pradeepkumar, Pengcheng Jiang, Zhiyong Lu, Jimeng Sun

TL;DR
This paper introduces TrialPanorama, a large-scale structured dataset of 1.6 million clinical trials linked with biomedical ontologies, enabling the development of specialized large language models that outperform generic models in clinical research tasks.
Contribution
The creation of TrialPanorama, a comprehensive dataset linking clinical trials with biomedical data, and the development of a specialized 8B LLM that surpasses larger generic models in clinical research tasks.
Findings
The 8B LLM outperforms 70B generic models across all tasks.
Generic LLMs have limited clinical reasoning capabilities.
TrialPanorama enables effective training and evaluation of clinical research AI models.
Abstract
Developing artificial intelligence (AI) for clinical research requires a comprehensive data foundation that supports model training and rigorous evaluation. Here, we introduce TrialPanorama, a large-scale structured resource that aggregates 1.6M clinical trial records from fifteen global registries and links them with biomedical ontologies and associated literature. To demonstrate its utility, we build a pipeline that constructs 152K training and testing samples for eight key clinical research tasks. Three tasks support systematic review workflows, including study search, study screening, and evidence summarization. Five tasks focus on trial design and optimization, including arm design, eligibility criteria design, endpoint selection, sample size estimation, and trial completion assessment and rationalization. Benchmarking cutting-edge large language models (LLMs) reveals that generic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Genomics and Rare Diseases
