FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers

Sarina Xi; Vishisht Rao; Justin Payan; Nihar B. Shah

arXiv:2511.21843·cs.CL·December 1, 2025

FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers

Sarina Xi, Vishisht Rao, Justin Payan, Nihar B. Shah

PDF

Open Access

TL;DR

This paper introduces FLAWS, a benchmark for evaluating how effectively large language models can identify and localize errors in scientific papers, addressing a critical challenge in peer review amid growing scientific output.

Contribution

The paper presents FLAWS, a novel automated benchmark with 713 paper-error pairs, designed to systematically evaluate LLMs' ability to detect and localize errors in scientific research papers.

Findings

01

GPT 5 achieves 39.1% accuracy in error identification

02

The benchmark effectively challenges LLMs with well-defined, relevant errors

03

Systematic insertion of errors enables scalable evaluation of LLM capabilities

Abstract

The identification and localization of errors is a core task in peer review, yet the exponential growth of scientific output has made it increasingly difficult for human reviewers to reliably detect errors given the limited pool of experts. Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks, from academic peer review to automated scientific assessment. However, despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored. In this work, we introduce Fault Localization Across Writing in Science (FLAWS), an automated benchmark consisting of 713 paper-error pairs designed to evaluate how effectively LLMs detect errors that undermine key claims in research papers. We construct the benchmark by systematically inserting claim-invalidating errors into peer-reviewed papers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Academic integrity and plagiarism