Agreement Between Large Language Models, Human Reviewers, and Authors in Evaluating STROBE Checklists for Observational Studies in Rheumatology

Emre Bilgin; Ebru Ozturk; Meera Shah; Lisa Traboco; Rebecca Everitt; Ai Lyn Tan; Marwan Bukhari; Vincenzo Venerito; Latika Gupta

arXiv:2603.19303·cs.DL·March 23, 2026

Agreement Between Large Language Models, Human Reviewers, and Authors in Evaluating STROBE Checklists for Observational Studies in Rheumatology

Emre Bilgin, Ebru Ozturk, Meera Shah, Lisa Traboco, Rebecca Everitt, Ai Lyn Tan, Marwan Bukhari, Vincenzo Venerito, Latika Gupta

PDF

Open Access

TL;DR

This study compares the agreement of large language models, human reviewers, and authors in evaluating STROBE checklist compliance in rheumatology observational studies, highlighting LLMs' potential and limitations.

Contribution

It provides a comparative analysis of LLMs, human reviewers, and authors in assessing STROBE adherence, revealing LLMs' strengths in simple checks and weaknesses in complex evaluations.

Findings

01

LLMs achieved complete agreement on formatting elements.

02

Agreement was high for Presentation and Context domains.

03

Lower agreement observed on complex methodological items.

Abstract

Introduction: Evaluating compliance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement can be time-consuming and subjective. This study compares STROBE assessments from large language models (LLMs), a human reviewer panel, and the original manuscript authors in observational rheumatology research. Methods: Guided by the GRRAS and DEAL Pathway B frameworks, 17 rheumatology articles were independently assessed. Evaluations used the 22-item STROBE checklist, completed by the authors, a five-person human panel (ranging from junior to senior professionals), and two LLMs (ChatGPT-5.2, Gemini-3Pro). Items were grouped into Methodological Rigor and Presentation and Context domains. Inter-rater reliability was calculated using Gwet's Agreement Coefficient (AC1). Results: Overall agreement across all reviewers was 85.0% (AC1=0.826). Domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReliability and Agreement in Measurement · Meta-analysis and systematic reviews · Delphi Technique in Research