# Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

**Authors:** Nils Dycke, Iryna Gurevych

arXiv: 2508.21422 · 2026-02-02

## TL;DR

This paper introduces a new automated counterfactual evaluation framework to assess whether AI reviewers can detect faulty reasoning in research papers, revealing current limitations in their ability to identify logical flaws.

## Contribution

The paper presents a novel, fully automated evaluation framework and dataset for testing AI's ability to detect faulty research logic, highlighting current shortcomings.

## Key findings

- AI reviewers do not significantly detect flawed logic
- Counterfactual evaluation reveals limitations of current ARGs
- Framework and dataset are publicly released for future research

## Abstract

Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper's results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21422/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21422/full.md

## References

72 references — full list in the complete paper: https://tomesphere.com/paper/2508.21422/full.md

---
Source: https://tomesphere.com/paper/2508.21422