Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study

Zhichao He; Mouxiao Bian; Jianhong Zhu; Jiayuan Chen; Yunqiu Wang; Wenxia Zhao; Tianbin Li; Bing Han; Jie Xu; Junyan Wu

arXiv:2511.13107·cs.CL·November 18, 2025

Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study

Zhichao He, Mouxiao Bian, Jianhong Zhu, Jiayuan Chen, Yunqiu Wang, Wenxia Zhao, Tianbin Li, Bing Han, Jie Xu, Junyan Wu

PDF

Open Access

TL;DR

This study evaluates the effectiveness of large language models in automatically assessing adherence to CONSORT guidelines in RCTs, revealing modest accuracy and highlighting current limitations for reliable use in peer review.

Contribution

It provides a systematic evaluation of LLMs' performance in identifying CONSORT adherence, demonstrating their potential and current shortcomings in this task.

Findings

01

Top models achieved macro F1 scores around 0.63.

02

Models excel at identifying compliant items but struggle with non-compliance.

03

Current LLMs are not reliable enough to replace human expert assessment.

Abstract

The Consolidated Standards of Reporting Trials statement is the global benchmark for transparent and high-quality reporting of randomized controlled trials. Manual verification of CONSORT adherence is a laborious, time-intensive process that constitutes a significant bottleneck in peer review and evidence synthesis. This study aimed to systematically evaluate the accuracy and reliability of contemporary LLMs in identifying the adherence of published RCTs to the CONSORT 2010 statement under a zero-shot setting. We constructed a golden standard dataset of 150 published RCTs spanning diverse medical specialties. The primary outcome was the macro-averaged F1-score for the three-class classification task, supplemented by item-wise performance metrics and qualitative error analysis. Overall model performance was modest. The top-performing models, Gemini-2.5-Flash and DeepSeek-R1, achieved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMeta-analysis and systematic reviews · Ethics in Clinical Research · Academic integrity and plagiarism