CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design
Nafis Neehal, Bowen Wang, Shayom Debopadhaya, Soham Dan, Keerthiram, Murugesan, Vibha Anand, Kristin P. Bennett

TL;DR
CTBench is a new benchmark designed to evaluate language models' ability to identify baseline features in clinical trial design, aiding in improving AI tools for more accurate and efficient clinical research planning.
Contribution
Introduces CTBench, a comprehensive benchmark with datasets and evaluation methods for assessing language models' performance in clinical trial baseline feature extraction.
Findings
GPT-4o shows promising evaluation capabilities validated by clinical experts.
Advanced prompt engineering improves baseline feature generation.
Benchmark highlights areas for future AI improvements in clinical trial design.
Abstract
CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design. Given study-specific metadata, CTBench evaluates AI models' ability to determine the baseline features of a clinical trial (CT), which include demographic and relevant features collected at the trial's start from all participants. These baseline features, typically presented in CT publications (often as Table 1), are crucial for characterizing study cohorts and validating results. Baseline features, including confounders and covariates, are also necessary for accurate treatment effect estimation in studies involving observational data. CTBench consists of two datasets: "CT-Repo," containing baseline features from 1,690 clinical trials sourced from clinicaltrials.gov, and "CT-Pub," a subset of 100 trials with more comprehensive baseline features gathered from relevant publications. Two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Residual Connection · Weight Decay · Softmax · Layer Normalization · Attention Dropout · Linear Warmup With Linear Decay · Dropout · Adam
