TL;DR
This paper explores using large language models as automated tools to evaluate scientific papers for critical issues, aiming to improve peer review efficiency and reliability.
Contribution
It introduces baseline methods and an automatic evaluation framework using reasoning LLMs, validated on arXiv papers, with insights into scientific reasoning and a public dataset.
Findings
o3 model achieved the best problem detection performance
The framework effectively identifies critical errors in scientific papers
API costs for models are analyzed for practical deployment
Abstract
Recent advancements in large language models have sparked interest in utilizing them to aid the peer review process of scientific publication amid the peer review crisis. However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews and instigating intentional manipulation. As an alternative, we propose adopting LLMs as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs available in May-June 2025 and assessed their performance and API costs for identifying critical errors and unsoundness problems in scientific papers.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
