Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation

Tianmai M. Zhang; Neil F. Abernethy

arXiv:2505.23824·cs.CL·April 3, 2026

Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation

Tianmai M. Zhang, Neil F. Abernethy

PDF

1 Repo

TL;DR

This paper explores using large language models as automated tools to evaluate scientific papers for critical issues, aiming to improve peer review efficiency and reliability.

Contribution

It introduces baseline methods and an automatic evaluation framework using reasoning LLMs, validated on arXiv papers, with insights into scientific reasoning and a public dataset.

Findings

01

o3 model achieved the best problem detection performance

02

The framework effectively identifies critical errors in scientific papers

03

API costs for models are analyzed for practical deployment

Abstract

Recent advancements in large language models have sparked interest in utilizing them to aid the peer review process of scientific publication amid the peer review crisis. However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews and instigating intentional manipulation. As an alternative, we propose adopting LLMs as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs available in May-June 2025 and assessed their performance and API costs for identifying critical errors and unsoundness problems in scientific papers.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.