CTBench: A Comprehensive Benchmark for Evaluating Language Model   Capabilities in Clinical Trial Design

Nafis Neehal; Bowen Wang; Shayom Debopadhaya; Soham Dan; Keerthiram; Murugesan; Vibha Anand; Kristin P. Bennett

arXiv:2406.17888·cs.CL·June 27, 2024·3 cites

CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design

Nafis Neehal, Bowen Wang, Shayom Debopadhaya, Soham Dan, Keerthiram, Murugesan, Vibha Anand, Kristin P. Bennett

PDF

Open Access 1 Repo

TL;DR

CTBench is a new benchmark designed to evaluate language models' ability to identify baseline features in clinical trial design, aiding in improving AI tools for more accurate and efficient clinical research planning.

Contribution

Introduces CTBench, a comprehensive benchmark with datasets and evaluation methods for assessing language models' performance in clinical trial baseline feature extraction.

Findings

01

GPT-4o shows promising evaluation capabilities validated by clinical experts.

02

Advanced prompt engineering improves baseline feature generation.

03

Benchmark highlights areas for future AI improvements in clinical trial design.

Abstract

CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design. Given study-specific metadata, CTBench evaluates AI models' ability to determine the baseline features of a clinical trial (CT), which include demographic and relevant features collected at the trial's start from all participants. These baseline features, typically presented in CT publications (often as Table 1), are crucial for characterizing study cohorts and validating results. Baseline features, including confounders and covariates, are also necessary for accurate treatment effect estimation in studies involving observational data. CTBench consists of two datasets: "CT-Repo," containing baseline features from 1,690 clinical trials sourced from clinicaltrials.gov, and "CT-Pub," a subset of 100 trials with more comprehensive baseline features gathered from relevant publications. Two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nafis-neehal/CTBench_LLM
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Residual Connection · Weight Decay · Softmax · Layer Normalization · Attention Dropout · Linear Warmup With Linear Decay · Dropout · Adam