FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers
Sarina Xi, Vishisht Rao, Justin Payan, Nihar B. Shah

TL;DR
This paper introduces FLAWS, a benchmark for evaluating how effectively large language models can identify and localize errors in scientific papers, addressing a critical challenge in peer review amid growing scientific output.
Contribution
The paper presents FLAWS, a novel automated benchmark with 713 paper-error pairs, designed to systematically evaluate LLMs' ability to detect and localize errors in scientific research papers.
Findings
GPT 5 achieves 39.1% accuracy in error identification
The benchmark effectively challenges LLMs with well-defined, relevant errors
Systematic insertion of errors enables scalable evaluation of LLM capabilities
Abstract
The identification and localization of errors is a core task in peer review, yet the exponential growth of scientific output has made it increasingly difficult for human reviewers to reliably detect errors given the limited pool of experts. Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks, from academic peer review to automated scientific assessment. However, despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored. In this work, we introduce Fault Localization Across Writing in Science (FLAWS), an automated benchmark consisting of 713 paper-error pairs designed to evaluate how effectively LLMs detect errors that undermine key claims in research papers. We construct the benchmark by systematically inserting claim-invalidating errors into peer-reviewed papers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Academic integrity and plagiarism
