Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework
Nils Dycke, Iryna Gurevych

TL;DR
This paper introduces a new automated counterfactual evaluation framework to assess whether AI reviewers can detect faulty reasoning in research papers, revealing current limitations in their ability to identify logical flaws.
Contribution
The paper presents a novel, fully automated evaluation framework and dataset for testing AI's ability to detect faulty research logic, highlighting current shortcomings.
Findings
AI reviewers do not significantly detect flawed logic
Counterfactual evaluation reveals limitations of current ARGs
Framework and dataset are publicly released for future research
Abstract
Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper's results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three…
| Target | Original | Compromised | Paper Edit |
|---|---|---|---|
| Finding (example on Lin et al. (2024)) | Spoken-LLM outperforms text-only baselines and prior speech LLM methods […]. | Spoken-LLM outperforms all existing models […]. | "With the same backbone model, the proposed method outperforms all existing models […]." |
| Conclusion (example on Chen et al. (2024)) | The MFT method achieved a 5% increase in accuracy on the GSM8K dataset. | The MFT method achieved a 5% increase in accuracy on the GSM8K dataset, with an even greater improvement of 7% observed […]. | "With just this minor modification, a 5% increase in accuracy can be achieved […] and an even greater improvement of 7% […]." |
| Result (example on Rao et al. (2023)) | The consistency scores […] were quantified, revealing that ChatGPT had a score of 0.907 […]. | Our findings indicate that while ChatGPT’s consistency score was slightly lower at 0.807 compared […] | Table 2: 0.907 0.807 |
| Paper Distribution | |
|---|---|
| #papers | |
| #papers p. conference | |
| #papers p. institution | |
| Research Logic Distribution | |
| #paper types | |
| #papers p. paper type | |
| #findings p. paper | |
| CFCR | CFNE | total | |
|---|---|---|---|
| #CFs | |||
| #CFs / paper | |||
| #edits / CF | |||
| diff. / CF |
| ARG | z |
|---|---|
| oracle | |
| Reviewer2 | |
| Zero-Generic-GPT4.1 | |
| Zero-Generic-Phi4 | |
| Zero-Generic-GPT4om | |
| Zero-Generic-DeepSeek14B | |
| Zero-Guide-GPT4om | |
| Zero-Guide-DeepSeek14B | |
| Zero-Guide-DeepSeekV3 | |
| DeepReviewer | |
| TreeReviewer | |
| Zero-Guide-Phi4 | |
| Zero-Generic-DeepSeekV3 |
| Aspects | Sentiment | Score | ||||
| p-value | corrected | p-value | corrected | p-value | corrected | |
| oracle | ||||||
| Reviewer2 | ||||||
| DeepReviewer | ||||||
| Zero-Guide-DeepSeek14B | ||||||
| Zero-Generic-DeepSeek14B | ||||||
| Zero-Guide-DeepSeekV3 | ||||||
| Zero-Generic-DeepSeekV3 | ||||||
| Zero-Guide-Phi4 | ||||||
| Zero-Generic-Phi4 | ||||||
| Zero-Guide-GPT4om | ||||||
| Zero-Generic-GPT4om | ||||||
| Zero-Generic-GPT4.1 | ||||||
| TreeReviewer | ||||||
| ARG | ROUGE-2 | Assertion Jaccard |
|---|---|---|
| oracle | ||
| Reviewer2 | ||
| DeepReviewer | ||
| Zero-Guide-DeepSeek14B | ||
| Zero-Generic-DeepSeek14B | ||
| Zero-Guide-DeepSeekV3 | ||
| Zero-Generic-DeepSeekV3 | ||
| Zero-Guide-GPT4om | ||
| Zero-Generic-GPT4om | ||
| Zero-Guide-Phi4 | ||
| Zero-Generic-Phi4 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAcademic integrity and plagiarism · Topic Modeling · Expert finding and Q&A systems
Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework
Nils Dycke
Iryna Gurevych
UKP Lab, Department of Computer Science and
National Research Center for Applied Cybersecurity ATHENE
Technical University of Darmstadt
Abstract
Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper’s results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.111https://github.com/UKPLab/tacl2026-counter-review-logic
1 Introduction
Scholarly peer review is the cornerstone of academic quality control Birukou et al. (2011); Bornmann (2011). In this process, experts evaluate a submitted paper, writing review reports that assess its soundness, clarity, and novelty Jefferson et al. (2002). However, peer review is time-consuming and requires expertise Waltman et al. (2023). The growing volume of submissions, especially in AI research Sculley et al. (2018), and the shortage of qualified reviewers McCook (2006) exacerbate the burden on reviewers. The rise of Large Language Models (LLMs) like ChatGPT OpenAI et al. (2024) has led to an upsurge in LLM-assisted, and sometimes fully generated, review reports Liang et al. (2024a), their official integration into peer review systems Association for the Advancement of Artificial Intelligence (2025), and growing research interest in automatic review generators (ARGs) Yu et al. (2024); D’Arcy et al. (2024); Yuan et al. (2022), machine-based scientific discovery Weng et al. (2025); Lou et al. (2025), and peer review assistance Chamoun et al. (2024); Dycke et al. (2023a). Yet, LLMs are prone to factual errors Ji et al. (2023) and biases Gallegos et al. (2024), making their unsupervised use as ARGs a potential threat to scientific integrity. It is imperative to empirically assess the capabilities of current ARGs to inform policymakers, guide responsible integration of LLMs into peer review, and clarify which aspects of reviewing require human judgment.
Current research on ARG performance yields highly inconsistent results with some studies reporting that ARGs show human-level or superior performance Tyser et al. (2024); Liang et al. (2024b); Idahl and Ahmadi (2025); Kirtani et al. (2025) and commendable paper limitation identification abilities Zhang and Abernethy (2025); Liu and Shah (2023), while others find automatic reviews to be generic Du et al. (2024) or failing to spot obvious errors Son et al. (2025); Li et al. (2025). This variance stems from two key factors: First, definitions of review quality differ widely. Second, peer review involves a diverse set of complex subtasks, such as recalling literature, reading comprehension, and step-wise reasoning Dycke et al. (2025). Many of these depend on the paper’s publication context such as related work, research trends, and evaluation standards; for instance, the assessment of state-of-the-art results is highly context-dependent since it is linked to the state-of-the-art at the time of reviewing. Existing evaluations conflate these skills and frequently disregard the original reviewing context Kuznetsov et al. (2024). While measuring overall reviewing performance is important, these factors confound existing evaluations inducing high variance across studies and prohibiting definitive conclusions about current ARG capabilities.
In this paper, to address these challenges, we evaluate ARGs in a controlled experiment considering a skill that is prerequisite to reliable peer review and, as we show in Section˜3, depends on the paper content regardless of the reviewing context. Specifically, we test ARGs ability to identify faulty research logic during reviewing. A paper’s research logic encompasses its experimental design, the reasoning from measurements to interpretations, and derived findings. Assessing the soundness of a paper requires careful scrutiny of these elements and their logical relations.
We first propose a new model of paper soundness formalizing it as a research logic graph. Based on this, we introduce a new counterfactual evaluation framework Molnar (2025); Wu et al. (2021) that extracts the research logic from sound papers, introduces targeted misalignments through surgical edits, and compares reviews of original versus counterfactual versions to determine whether faulty research logic significantly impacts ARGs reviewing behavior. Unlike prior error sensitivity analyses Liu and Shah (2023); Zhang and Abernethy (2025); Li et al. (2025); Son et al. (2025), our analysis eliminates confounding effects by using counterfactuals that intervene solely on the soundness of the research logic, and does not depend on an absolute notion of review quality by focusing on relative differences between reviews. We develop and validate an LLM-based pipeline to produce counterfactual versions of research papers with intentionally flawed research logic. The resulting dataset supports multiple downstream applications, including evaluation, explanation, and training data augmentation Wu et al. (2021). Our fully automatic approach is independent of human review data allowing to test ARGs on new and unseen papers at scale without risks of contamination Sainz et al. (2023). Our framework is general and applicable to any ARG regardless of its architecture.
Our experiments show that faulty research logic has no significant effect on the generated reviews of state-of-the-art ARGs raising serious concerns about their practical use during peer review. We analyze their automatic reviewing behavior and derive actionable recommendations to move the field forward including task design, human-LLM collaboration, and improved evaluation practices. Overall, we contribute
- •
A formal model of paper soundness as a product of its underlying research design and reasoning, i.e., its research logic.
- •
The first fully automatic counterfactual evaluation framework for ARGs focused on the detection of flawed research logic.
- •
A novel dataset of counterfactual research papers based on recent AI and NLP publications from major conferences including ACL, EMNLP, NeurIPS, and ICLR.
- •
Insights into the capabilities and limitations of state-of-the-art ARGs in identifying flawed research logic during review.
2 Related Work
Counterfactual Evaluation
Counterfactuals are the key to study causal relationships Pearl (2000). By asking, what would a model predict under the same conditions if the input were changed?, counterfactuals have become a standard approach in NLP for explaining Molnar (2025); Wang et al. (2024a) and evaluating models Wu et al. (2021). Prior work is largely limited to short texts Wang et al. (2024a) or fixed narrative structures Mu and Li (2024); Wang et al. (2024b). In contrast, we propose a pipeline to construct counterfactuals that intervene on full research papers. Further, while most approaches are designed for classification problems Molnar (2025), we develop a framework tailored to full review reports as an output.
ARG Error Sensitivity Analysis
Recent works on ARG evaluation (Shin et al., 2025; Du et al., 2024; Xu et al., 2025, i.a.) heavily rely on often noisy human review data as ground truth. To reduce this dependence, several works perform sensitivity analysis by introducing errors into papers. Liu and Shah (2023) manually inject errors into short scientific excerpts and inspect whether automatic reviews mention them. While their study is limited to scientific essays, we consider nearly full papers. Zhang and Abernethy (2025) and Son et al. (2025) use retracted papers, leveraging self-reported errors to evaluate ARGs via LLM-based judges. These approaches depend on public retraction data and often involve issues, like plagiarized figures, that require external knowledge. In contrast, we focus on verifying research logic, which is self-contained within the paper. Li et al. (2025) automatically perturb paper sections (e.g. omitting implementation details) but unlike our study they exclusively consider automatic review scores without reports. Tyser et al. (2024) propose a semi-automatic method to introduce broad error types (e.g. omitting related work) and check if they are detected. Our approach differs by introducing precise, targeted edits that affect only the paper’s soundness. Finally, Dycke et al. (2025) use counterfactuals to investigate reasoning during peer review but focus on humans instead of ARGs.
In summary, this work is the first to isolate and study ARG’s reasoning abilities during review generation in a fully automatic evaluation pipeline.
3 Research Logic
Soundness, i.e. the correctness of a paper’s underlying scientific process, is a central criterion in peer review Jefferson et al. (2002). To evaluate whether ARGs can identify flawed research design and reasoning, we first define soundness formally. We focus on AI and NLP domains as they are most common in prior work Kuznetsov et al. (2024) allowing direct comparison; unless stated otherwise, we refer to papers222We use the terms paper and manuscript interchangeably. from only these fields.
Soundness
As in any empirical science, a sound NLP or AI paper makes a contribution to collective knowledge and justifies it through valid reasoning based on empirical evidence Armstrong and Green (2022). By this definition, a paper’s soundness comprises four building blocks, which we refer to throughout the rest of the paper: the method describing the experimental setup, the empirical results generated by them, the conclusions drawn from these results, and the finding summarizing the conclusions as a scientific contribution. Well-designed papers explicitly present all four components. While not necessarily all papers present empirical findings, e.g., those with theoretical proofs or opinions, our analysis in Section˜5 shows that a vast majority of NLP and AI research papers make at least some empirical contribution. Based on this, we define our model of soundness formally. Our model effectively operationalizes Bacon’s (1878) theory of the scientific method as inductive reasoning from evidence.
Formal Model
A paper claims to contribute findings . The paper is sound only if all findings are sound. A finding is sound iff all underlying conclusions are sound and jointly support . A conclusion is sound iff its underlying results are sound and jointly support . A result is sound iff the experimental methodology adheres to best scientific practice and plausibly produces the result. We assume that and are the minimal sufficient sets of conclusions and results, respectively; i.e., all elements are necessary to support the corresponding finding or conclusion . These building blocks form a hierarchical, logical structure, which we call the paper’s research logic, in which arcs represent the support relation. Figure 1 illustrates this structure. Importantly, the research logic differs from the paper’s broader argumentative structure, which consists of rhetorical moves to make the paper appear interesting and plausible Teufel et al. (2009). In Aristotelian terms, the research logic refers to the logical appeal (logos) while the broader argumentation in the paper often also invokes ethos (credibility) and pathos (appeal to emotion). In other words, soundness and convincingness are independent concepts of a paper.
Soundness Verification
An unsound paper violates one or more of the support relations defined above. A common case is over-claiming, where findings misalign with the actual results; i.e., the conclusions do not jointly support the finding, usually additional conclusions would be needed. Verifying a paper’s soundness thus involves checking each building block and the corresponding support relations within the research logic hierarchy. This process relies on background knowledge only when assessing the methodology; evaluating the soundness of conclusions and findings depends primarily on information within the paper. As such, verifying the support relationships between results, conclusions, and findings is a self-contained, context-independent task that tests reasoning and reading comprehension skills central to peer review and well-suited for evaluating ARGs.
4 Framework
We use ARG as an umbrella term for systems that receive a research paper as an input and output a review report. While ARGs usually involve LLMs, we do not probe LLMs per-se but focus on their behavior in the context of automatic reviewing. A reliable ARG should detect flaws in research logic and reflect them in its output. Unlike prior work (Zhang and Abernethy, 2025, i.a.), which checks for mentions of specific issues, we focus on the average effect of flawed logic on automated reviews, recognizing that peer reviews weigh multiple factors and cannot highlight every concern. For instance, limited novelty might overshadow methodological flaws. Formally, we estimate the concept average treatment effect (ATE) Goyal et al. (2019) of research logic interventions on automatically generated reviews using approximate counterfactuals Gat et al. (2024). In other words, we surgically edit high-quality papers to compromise their research logic and quantify the resulting changes in automatic reviews, which should, on average, become more critical and emphasize soundness more compared to the automatic review of the original paper.
4.1 Pipeline
Our framework consists of three stages (Figure 2). First, given a paper , we generate two sets of counterfactuals (CFs): soundness-critical , which introduces errors to the paper’s research logic, altering its soundness while preserving other concepts such as clarity and novelty; and soundness-neutral , which apply surface-level edits (e.g., formatting and language changes) to later contextualize the effects of the soundness-critical edits.
In the second stage, we run the ARG to generate a review per original paper, for the soundness-critical counterfactuals, and for the neutral ones. In the final stage, we extract numerical features from each review and compute the differences and between the original review and those from and , respectively. Aggregating over the dataset , we estimate the average treatment effect of soundness critical edits , soundness-neutral edits , and compare them for each ARG.
4.2 Counterfactual Paper Generation
Edits for counterfactual generation must satisfy four desiderata Molnar (2025); Gat et al. (2024). Edits must be relevant (A); i.e., they should directly impact the paper’s soundness negatively. They must be minimal (B); i.e., modifying only elements tied to soundness while preserving all other concepts. Edits must also be plausible (C); i.e., maintaining topical focus, fluency, and coherence. Finally, counterfactuals should be diverse (D), encompassing varied modifications to the research logic. While LLMs perform well in generating counterfactuals for sentences Li et al. (2024), they are not readily applicable to full research papers. To address this, we propose a two-step approach (Figure 2, Step I). First, we automatically extract the paper’s research logic. Then, LLMs disrupt the support relationship between findings, results, and conclusions (A). These changes are then projected on the paper text to ensure minimal edits (B).
Research Logic Extraction
We use zero-shot prompting with one self-refinement step Madaan et al. (2023). For each building block, we prompt the LLM to extract supporting spans with paragraph IDs and generate a summary. We apply this paradigm with building block specific adjustments for all steps.333See Appendix A.2 for prompts and pseudo-code. The LLM first summarizes the paper’s research goal. It then extracts all contribution claims, filters for empirical findings, and ranks them by relevance to the goal. This ranking may not fully match human judgment but helps prioritize elements to modify during counterfactual generation. Next, it extracts all conclusions and links them to the supported findings. It identifies all results covered in figures, tables, and textual mentions and connects them to the conclusions. The output consists of the finding claims , conclusions , and results , with their spans, coreferences, and joint support relations.
Soundness-critical Edits
After prompting the LLM to revise a selected building block to compromise its support relations in the research logic, the LLM edits the paper reflecting the new unsound logic. We apply three types of edits, on different levels of the research logic hierarchy for diversity (D). Table 1 provides examples. For finding edits, we misalign a finding with its underlying conclusions drawing on prior work on misrepresented science Wuehrl et al. (2024). The LLM classifies the finding as a (i) correlational, (ii) causal, or (iii) conditional claim. If none apply, no counterfactual is generated. For correlational claims, the LLM rephrases them as causal claims; for causal claims, it inverts the causal direction; for conditional claims, it removes the stated conditions. For conclusion edits, we misalign a conclusion with the underlying results. The LLM first generates a hypothetical result consistent with the paper’s scope but unsupported by the experiments. It then augments the conclusion and derived finding to include this result, leaving part of it unsubstantiated. For result edits, we misalign a result with a derived conclusion. The LLM negates or weakens the result leading to unsupported conclusions. For example, it may reduce a strong performance gain to a marginal improvement.
We design dedicated prompts per edit type using the LLM in zero-shot mode with one self-refinement cycle. We apply the perturbations to the building blocks connected to the most important finding according to the extracted research logic to ensure high impact on the overall paper. Each operation runs independently, yielding three soundness-critical counterfactuals per paper, each creating a distinct type of issue. After modifying the research logic, we apply a set of zero-shot prompts with self-refinement to propagate the edits to the paper text, ensuring fluency and plausibility (C). The LLM considers the revised research logic and the associated textual spans to propose edits. As we focus on unimodal textual ARGs, edits happen only on textual mentions and captions of figures, not the figures themselves. For tables, the LLM uses a chain-of-thought prompt to reason about its structure and content before modification Fang et al. (2024). Finally, an LLM-as-a-judge Zheng et al. (2023) evaluates whether each counterfactual meets the desiderata. If rejected, the LLM generates a new counterfactual targeting the next top-ranked finding.
Soundness-neutral Edits
We apply four simple edits that preserve both content and soundness. We randomly select a subset of paragraphs for each edit type. For active-to-passive, the LLM rewrites text from active to passive voice and vice versa. For American-to-British, it converts spelling between American and British English. For language error, we inject minor spelling and grammar mistakes. For paper layout, we multiply whitespaces at random and relocate figure captions and tables to the end of the paper, altering layout but not content.
4.3 Review Comparison
The goal of third step (see Figure 2) is to quantify differences between reviews of the original and counterfactual papers. Faulty research logic should lead to reviews that are, on average, more negative and more focused on soundness. We approximate these concepts using three review features computing the difference per feature between the counterfactual and the original review as the review difference. First, we analyze the distribution of aspects, i.e. the paper dimensions that the review discusses, such as experiments or presentation. We detect aspects using a RoBERTa model fine-tuned by Lu et al. (2025), and compute the number of research-logic-related aspects. We compute the review difference as . A large positive value indicates an increased focus on soundness as this means that soundness-related comments were added to the review. Second, we assess the sentiment of review assertions. Reviews consist of positive, negative, or neutral assertions on the paper Dycke et al. (2023b). We extract and classify these assertions by their sentiment using GPT-4o-mini in zero-shot mode with self-refinement. We compute the review difference as in . Values closer to mean stronger criticism related to soundness. Third, aligning with Li et al. (2025), we track changes in the review score summarizing the overall reviewer opinion. We compute . We expect the score to drop when soundness is compromised as indicated by a negative value.
5 Dataset
Applying our pipeline, we first create a dataset of papers, associated research logic, and counterfactual versions, and validate each step. We list the model versions in Appendix A.3 and prompts in the supplementary materials. We develop prompts for each step individually and manually verify outputs on a subset of papers. We begin with a basic prompt and iteratively refine it using Claude Sonnet 3.5 Anthropic (2024) akin to meta-prompting Zhou et al. (2023). We adjust prompts until we consider the outputs correct for all test samples. All prompts follow a similar structure and use json output format in line with prompt engineering best practices Phoenix and Taylor (2024) (see Figure 6 in the Appendix).
5.1 Source Data
We construct our dataset from four major AI and NLP conferences over two years to generalize beyond individual venues Kuznetsov et al. (2024). Specifically, we include papers from the Association of Computational Linguistics (ACL) conferences ACL-23 and ACL-24, EMNLP-23 and EMNLP-24, from the Neural Information Processing Systems conference NeurIPS-24, and International Conference on Learning Representations ICLR-25. To ensure high-quality input papers with valid research logic, we include only accepted papers. We collect openly licensed papers from ICLR-25, NeurIPS-24, and EMNLP-23 via OpenReview444https://openreview.net, and from the ACL Anthology555https://aclanthology.org/ for the remaining ACL conferences, yielding approximately instances. We convert papers to Markdown format suitable for LLM processing. We remove instances with parsing issues retaining reliably parsed papers. From these, we sample papers evenly across conferences: for development and prompt tuning, and for testing. This sample size balances paper diversity with the computational cost of review generation for original and counterfactual papers later. Appendix A details the preprocessing procedure.
5.2 Research Logic Data
Through manual refinement, we develop the prompts for extracting the research logic on the development set and select the best performing model. GPT-4o-mini OpenAI et al. (2024) offered the best trade-off between inference speed, cost and output quality. We employ self-refinement Madaan et al. (2023), which notably improved the alignment between building blocks consistent with findings on other reasoning-intensive tasks Sahoo et al. (2024). A single refinement cycle achieved the best performance–cost trade-off. We query the OpenAI API666https://platform.openai.com/ for model inference. On average, the extraction of the research logic takes roughly five minutes per paper.
Validation Study
We validate the accuracy of research logic extraction through human evaluation focusing on the factual alignment between each extracted building block and the source paper. We recruit three postgraduate NLP researchers with at least four years of experience reading academic papers. For each annotation, we provide the paper, the building block type, its summary, and the main passage provided by the LLM. Annotators assess the factual correctness of the summary, focusing on the main highlighted passage but they are allowed to consult the full paper. We also ask them to note issues beyond factual accuracy. We pilot the process on randomly sampled building blocks from papers. We then compute inter-annotator agreement (IAA) on this pilot plus additional samples. In total, annotators label building blocks from papers across conferences. We use Goldstein’s S Bennett et al. (1954) to measure IAA, as it is more suited for imbalanced label distributions than Krippendorff’s Feinstein and Cicchetti (1990), given that of labels are positive. We obtain , indicating moderate agreement. Upon inspection, we find that in of all samples the LLM selects an incorrect passage to support its summary; in of disagreement cases, at least one annotator flags this issue. This induces disagreement because of varying considered context by the annotators.
Validation Result
We compute the majority vote for redundant annotations and merge them with individually labeled instances, finding that of building blocks are factually accurate. We consider this accuracy sufficient as an intermediate step in counterfactual generation. To address the issue of incorrect evidence spans, we revise the extraction prompts requiring the LLM to cite specific text spans together with their paragraphs ID and apply it throughout the remainder of the study.
5.3 Counterfactual Data
Following our general prompt development paradigm, we tune prompts and select the best LLM based on manual inspection on the development set.
Soundness-neutral Counterfactuals
We generate soundness-neutral counterfactuals using Phi-4 14B Abdin et al. (2024). We run Phi-4 (with Q4_K_M quantization) on two L40 GPUs with approximately 14GB effective memory use during inference. On average, the generation of one soundness-neutral counterfactual takes roughly three minutes. For active-to-passive and American-to-British, we randomly select 40% of the paragraphs; for language error, we select 20% to simulate minor language issues. Given the simplicity of these edits, one author manually inspects counterfactuals per type to verify that the edits do not affect the paper content. In all cases, the revised text preserves the original meaning.
Soundness-critical Counterfactuals
We use GPT-4o-mini to generate soundness-critical counterfactuals. During prompt development, we found that decomposing the process into multiple stages, where each corresponds to a single LLM call producing an intermediate result with an explanation, substantially improved output quality. We explicitly separate stages that require reasoning from those that modify the paper’s content since this produced more diverse outputs. We further divide creative stages, such as proposing new hypothetical results (see Sec. 4.2), into candidate generation and subsequent selection. If an edit type is not applicable, e.g., a paper lacks the necessary claim types for finding edits, we skip the counterfactual for that paper. The generation of one soundness-critical counterfactual takes on average roughly two minutes using the OpenAI API.
Validation Study
The human validation assesses whether counterfactuals meet the desiderata (see Section˜4.2). We employ the same three postgraduate NLP researchers as in the previous validation. We provide annotators with the counterfactual paper, the extracted research logic, the edited building block, and highlight the edits in the paper. First, annotators judge if the research logic is compromised (RL-cor) and plausible within the paper’s scope (RL-pla). Then, they evaluate the edits applied to the paper considering whether the edits affect the soundness (E-cor), are plausible (E-pla), and minimal (E-min). We pilot the annotation on counterfactuals. For the main study, annotators assess counterfactuals from papers, including used to compute IAA. Due to label imbalance, we report Goldstein’s S: for RL-cor, for RL-pla, for E-cor, for E-pla, and for E-min. Annotators show moderate to substantial agreement on the research logic modifications. In contrast, assessing the correctness of paper edits proves more subjective, particularly for E-cor; this is consistent to levels of subjectivity in peer review Bornmann (2011). Disagreements occur more frequently for edits of conclusions () and findings () where annotator comments indicate that certain edits, such as inserting the word ’significant’ as a statistical claim, are ambiguous leading to disagreement.
Validation Result
We use the majority vote on redundant annotations merged with individually labelled instances. Annotators judge of research logic edits as correct (RL-cor) and as plausible (RL-pla). They find that of the edits compromise the paper’s soundness (E-cor), are plausible (E-pla), and are minimal (E-min). For the purpose of our evaluation, we consider the desiderata met and account for residual noise during evaluation by comparing soundness-neutral and soundness-critical counterfactuals on a large set of instances.
5.4 Dataset Overview
Table 2 and 3 summarize the final evaluation dataset of counterfactuals on papers, excluding papers without empirical findings.
Diversity of Counterfactuals
As shown in Table 3, the number of edits and amount of changed text is comparable for both counterfactual types, ensuring fair comparison. Among soundness-critical edits, of them appear in the text, in tables, and in figure captions; the latter two only occur for result edits. The dominance of text edits reflects the textual nature of findings and conclusions, and the fact that many results are reported and interpreted in the text. Each edit type engages distinct reasoning over different areas of the paper.
Diversity of Papers
As shown in Table 2, the papers are nearly evenly distributed across conferences. To assess authorship diversity, we use the last author’s affiliated institution as a proxy, retrieving metadata via the OpenAlex API777https://openalex.org/ (excluding missed authors). On average, papers originate from the same institution, indicating high diversity. Regarding research logic, of the paper types classified during extraction, papers introducing new methodology dominate (), followed by analysis () and dataset papers (). This skew towards methodological work likely reflects the actual distribution of publications in AI and NLP conferences and encourages future work on more diverse paper types (see Sec. 7).
6 Evaluation
We run the evaluation pipeline based on the previously generated dataset to estimate the average treatment effect of soundness-neutral and -critical edits and compare them. The detailed model versions, generation hyperparameters, and prompts are reported in Appendix B.
6.1 Experiments
ARGs
We consider three types of ARGs from the literature. The first type uses LLMs in zero-shot mode either with a generic review prompt (Zero-Generic), e.g. in Liang et al. (2024b), or a prompt including venue-specific guidelines (Zero-Guide), e.g. in Du et al. (2024). We test both with GPT-4o-mini and GPT-4.1 as large proprietary LLMs, DeepSeek-14B and DeepSeekV3 DeepSeek-AI et al. (2024) as reasoning LLMs, and Phi-4 as a small open-weight model. We design the prompts based on related literature to ensure alignment, while adding a standardized output format for easy parsing. We perform formatting and plausibility checks on the development set without further tuning. The second type comprises multi-agent systems where LLMs specialize in different paper aspects and engage in a discussion. We include TreeReviewer Chang et al. (2025), implemented with GPT-4o-mini, as a representative. MARG D’Arcy et al. (2024), evaluated on our development set, proved computationally infeasible with individual reviews requiring at least minutes and up to hours. The third type uses LLMs fine-tuned on peer review data. We test Reviewer2 Gao et al. (2024), fine-tuned with an automatic prompting model, and DeepReviewer Zhu et al. (2025), which is further trained on synthetic reasoning data. We use default hyperparameters for all ARGs. We fix the random seed and, for zero-shot LLMs, set the temperature to zero to minimize output variance. We truncate the paper to the effective context window size per LLM Hsieh et al. (2024). When ARGs produce semi-structured text, e.g. DeepReviewer, we parse the output using regular expressions and, if needed, fall back to GPT-4o-mini for parsing. We include reviews that cannot be parsed into the venue’s template in raw form; if the parsing of scores fails, we do not consider those scores for evaluation.
Oracle Ablation
To validate, we include an oracle ARG. oracle takes the reviews generated by Zero-Guide-GPT-4om for the original papers and paraphrases them for each counterfactual. For soundness-critical counterfactuals, it also adds a comment on the introduced issue and lowers the overall score randomly. This setup simulates an ARG that reports the soundness issue in the review, and adjusts the overall score as a mild but notable reaction to unsound research logic.
Statistical Analysis
For each review difference dimension, we fit a linear mixed effect model (LME) Lindstrom and Bates (1988) per ARG. We test the null hypothesis that the ATE of the two conditions, soundness-critical or soundness-neutral, is identical; in other words, we test if soundness-critical edits have no stronger effect on reviews than the soundness-neutral edits. LMEs are designed for varying repeated measures. In our case, there are multiple review differences per paper; we model the paper as a random effect, while the review difference and condition are fixed effects. Finally, we use Benjamin-Hochberg correction Benjamini and Hochberg (1995) due to multiple testing on ARGs.
6.2 Results
Are automatic review reports significantly affected by soundness-critical edits?
Figure 3, 4, and 5 show the box plots of the ATE per difference dimension and both counterfactual types per ARG. To recap, ’Aspect’ (Figure 3) measures the number of soundness-related statements, ’Sentiment’ (Figure 4) the density of positive comments, and ’Score’ (Figure 5) the overall rating; all of them compare the ARG’s reviews for the counterfactuals with the one for the original (see Section˜4.3). For an ideal ARG, the ATE for soundness-critical and -neutral edits should lie far apart on all dimensions; the ATE of critical edits should be positive for aspects (soundness-related aspects are added) and negative for sentiment (less positive comments) and score (lower rating).
As expected, the ATE of soundness-neutral edits are close to zero for most ARGs. The variance is large for all ARGs and especially for the fine-tuned Reviewer2 and DeepReviewer. However, for all ARGs the ATE of soundness-critical edits is similar to the ATE of neutral edits. At a significance level of , none of the models show a significant difference in the ATE between neutral and critical edits for aspects, sentiment, or score.888Detailed p-values are reported in Appendix Table 5. In other words, faulty research logic has no statistically significant impact on the automatically generated reviews of any tested ARG.
Which of the ARGs are influenced most by faulty research logic?
Although none of the ARGs are significantly affected by flawed research logic, we rank the models to identify tendencies. A coherent ARG should show low ATE variance for soundness-neutral counterfactuals, being resilient to surface-level changes, and a large difference between the ATE for neutral and critical counterfactuals. To capture this in a single score, we express the ATE difference between both conditions as a multiple of the effect standard deviation of the neutral counterfactuals. For each ARG, we compute per dimension and then average over all dimensions. This yields the ranking shown in Table˜4. Notably, generic zero-shot prompts perform better than those with guidance, which may act as a distractor. For Phi-4, with its smaller context window, Zero-Guide likely causes truncation of key paper content. While the fine-tuned Reviewer2 ranks highest, DeepReviewer ranks lowest; likely, their backbone models, Llama-2 for Reviewer2 and Phi-4 for DeepReviewer, play an important role since the Zero-Guide results for Phi-4 suggest issues with context size and distractors. This warrants further investigation in future work. Finally, there is no clear link between model size and performance, as Phi-4 and GPT-4.1 both rank highly.
How does an ideal ARG perform?
To validate the evaluation pipeline, we evaluate the oracle ARG along with the others. Here, for all three dimensions there is a notable difference between critical and neutral counterfactuals which is confirmed by statistical analysis. The oracle reviews are significantly influenced (at ) by faulty research logic along all dimensions hereby confirming the validity of our pipeline.
How do these results contextualize with prior work?
Our results suggest that prior reports of ARGs achieving error discovery rates of around for GPT-4o-mini Zhang and Abernethy (2025) are influenced by the types of errors tested, which often rely on background knowledge recall; a task at which LLMs perform well AlKhamissi et al. (2022). For example, spotting an error in a softmax formula mainly tests recall with minimal reasoning. In contrast, our experiments isolate reasoning and show that ARGs clearly lack this skill. From this, we draw two recommendations for future work. First, to accurately evaluate ARGs, studies should test distinct skills in isolation to identify specific limitations and capabilities. Second, our findings highlight the potential for human–LLM collaboration in peer review. Our results show that ARGs alone fail to assess the consistency of research logic, whereas human reviewers benefit from AI assistance in knowledge-intensive steps Dycke et al. (2025). This suggests that peer review is well-suited to human–AI collaboration through co-construction Dutta et al. (2025), where a human expert and an AI system jointly construct a review through iterative interaction, compensating for each other’s limitations.
How sensitive are ARGs to the paper’s surface form?
Notably, all ARGs show high standard deviations for soundness-neutral counterfactuals, suggesting that ARGs are sensitive to spurious features similar to issues of prompt sensitivity in LLMs Sclar et al. (2024). To examine this further, we measure lexical similarity using ROUGE-2 and use GPT-4o-mini to detect whether assertions align between the original and soundness-neutral reviews, computing the Jaccard index on assertions. We perform human validation of the automatic alignment with two annotators on assertion pairs with substantial agreement (Krippendorff’s ) attesting an accuracy of . We analyze randomly sampled papers (roughly 8 per conference). Table 6 in the Appendix reports the detailed results. Surface similarity ranges from ROUGE-2 for DeepReviewer to for Zero-Guide-GPT4om, indicating substantial changes in wording for many ARGs. This pattern holds for the assertion overlap, with a Jaccard index from for DeepReviewer to for Zero-Generic-GPT4om indicating low to moderate overlap in the contents of the reviews. Overall, GPT-4o-mini produces more similar reviews than other ARGs, regardless of the prompting strategy. Even allowing for some measurement error, ARGs remain highly sensitive to spurious surface changes in the input text. Our findings point to another recommendation for future work: all ARG evaluations, whether using human reference data or sensitivity analysis, should report mean performance on multiple soundness-neutral versions of the papers akin to prompt-sensitivity reporting to better reflect their true capability independent of spurious features. Here, our counterfactual dataset may also help improve model consistency through training data augmentation Wu et al. (2021).
7 Limitations and Future Work
In this Section, we summarize potential limitations of the proposed framework and point to important future work.
Assumptions
We design the counterfactual generation to be generic and applicable across diverse papers. However, our approach makes assumptions on the paper structure that may not hold for all paper types and domains. Future work should extend faulty research logic detection to additional domains, paper types, and logical structures. For evaluation, we use accepted papers from reputable, peer-reviewed venues to ensure the original research logic is sound and extractable. While this assumption may not hold for every paper, averaging effects over a large, diverse sample helps manage noise. Future extensions to new data crucially need quality assurance of the underlying papers.
Counterfactual Generation
For counterfactual generation, we validate the plausibility and diversity of edits aiming to resemble human research logic errors. However, the distribution of constructed errors likely differs from those in real papers. Our evaluation tests ARGs’ canonical ability to detect flawed research logic under controlled conditions. If an ARG’s output is unaffected by flawed logic in this setting, it will also fail on real papers. Conversely, a significant effect does not guarantee strong real-world performance underscoring the need to study research logic flaws in human papers. Finally, we focus on unimodal ARGs, as editing figures poses a separate challenge and requires distinct reviewing skills Huang et al. (2025). Instead, we modify figure captions and related text to simulate figure edits from a text-only reviewer’s perspective. Future research should explore counterfactual generation for figures and test multimodal ARGs.
Interpretation
Our experiments show that state-of-the-art ARGs fail to identify flawed research logic in research papers. We further find that ARGS are sensitive to surface-level features, suggesting that the underlying LLMs rely on superficial heuristics instead of thorough reasoning. Investigating the causes of this behavior is an important direction for future work. In particular, distinguishing failures to identify soundness-relevant information in the paper from failures to recognize logical fallacies within the research logic itself are crucial for developing more robust ARGs in the future.
Reproducibility
For dataset creation and evaluation we primarily employ closed commercial models, as their outputs demonstrated higher quality during manual tuning. To enhance replicability, we release all model outputs in the supplementary materials. For reproducibility, detailed model versions are reported in Appendix B.2. However, the use of closed models inevitably limits future reproduction on new data, as such models may undergo unrecorded changes. In future work, the extension to open models that perform better or on-par with commercial LLMs for these steps is a promising avenue to enhance reproducibility.
8 Conclusion
We introduced a novel, fully automatic counterfactual evaluation framework for ARGs, focusing on the detection of flawed research logic. We proposed a three-step pipeline to estimate the effect of soundness-critical edits to papers on automatically generated peer reviews, testing the reasoning capabilities of ARGs. Our results show that current ARGs fail to detect faulty research logic. Based on this, we propose three directions to advance LLM-based reviewing systems: designing dedicated tests for distinct reviewing skills, fostering human–machine collaboration, and accounting for sensitivity to surface-level paper features during evaluation. Our work lays the groundwork for robust evaluation of ARGs.
Acknowledgements
This work has been funded by the LOEWE Distinguished Chair “Ubiquitous Knowledge Processing”, LOEWE initiative, Hesse, Germany (Grant Number: LOEWE/4a//519/05/00.002(0002)/81). This work has been co-founded by the German Federal Ministry of Education and Research and the Hessian Ministry of Higher Education, Research, Science, and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE. This research work has been co-funded by the European Union (ERC, InterText, 101054961). Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. We express our sincere gratitude to the reviewers and action editor at TACL for their valuable and constructive feedback.
Appendix A Dataset Preprocessing
A.1 Underlying Dataset Preprocessing
Since the counterfactual generation aims to make surgical edits of few tokens, the accuracy of parsing of papers needs to be high with the tables being particularly challenging. We therefore cross-match the accepted papers with ar5iv999https://ar5iv.labs.arxiv.org/(last accessed on 20/07/25) providing HTML versions of the papers constructed from the LaTeX source using title and author for retrieval. Roughly one quarter of the papers across all conferences can be matched exactly. We convert those original papers to markdown using their HTML version as a reference. Subsequently, we filter out papers for which the title or abstract could not be identified and where the number of sections lies below three suggesting an error.
A.2 Research Logic Extraction
A.2.1 Pseudo-code
We report the pseudo-code for extracting the research logic of a scientific paper in Algorithm 1. Due to space restrictions we defer the reader to the acompanying code of this paper for the detailed prompts used during each individual step.
A.2.2 Hyper-parameters
For the research logic extraction we query the OpenAI API101010https://openai.com/ (last accessed on 20/07/25) between May and June 2025. We use GPT-4o-mini in version 2024-07-18. For generation, we use the default hyper-parameters; i.e. a temperature of 1.0 and no nucleus sampling.
A.3 Counterfactual Generation
A.3.1 Prompts
The generation of counterfactuals involves multiple steps and prompts including self-refinement. Due to space restrictions, we report on the exact prompts used for counterfactual generation in the supplementary code. Figure 6 summarizes the general prompt structure which is common to all steps in the pipeline.
A.3.2 Hyper-parameters
Soundness-critical Counterfactuals
For the soundness-critical counterfactuals, we use GPT-4o-mini with the same configuration as before; i.e. version 2024-07-18 with a temperature of 1.0 queried in May and June 2025.
Soundness-neutral Counterfactuals
For the soundness-neutral counterfactuals, we use Phi-4 14B run with temperature of 0.8 and top-40 sampling. We run the ollama version.111111https://ollama.com/library/phi4 (last accessed on 20/07/2025)
Appendix B Experiments
B.1 ARG Zeroshot Prompts
Figure 7 shows the prompt used for the Zero-Generic approach. Figure 8 shows the prompt used for the Zero-Guide approach. The key difference, is a brief description of the reviewing approach provided as a template parameter based on publicly available reviewer guidelines of each venue.
B.2 ARG Hyper-parameters
For all ARGs we discard the paper’s appendix since this frequently lead to errors due to the limited context window of the models. If a paper exceeds the context window of an LLM according to the recommended ranges of best performance Hsieh et al. (2024), we truncate the paper at the end.
For Zero-Generic-GPT4om, Zero-Guide-GPT4om, MARG, TreeReviewer, and for fall-back parsing of the fine-tuned ARG outputs we use GPT-4o-mini version 2024-07-18 with a temperature of 0 queried in May to July 2025. For Zero-Generic-GPT4.1 we use version gpt-4_1-2025-04-14 with a temperature of 0 queried in July 2025. For the DeepSeekV3-based zero-shot ARGs we query the DeepSeek API121212https://api-docs.deepseek.com using version DeepSeek-V3-0324 as a chat model with temperature 0 queried in July 2025. We adapt the publicly available code of MARG131313https://github.com/allenai/marg-reviewer/tree/master (last accessed on 18.07.25) to use GPT-4o-mini. Likewise, we adapt the public code of TreeReviewer141414https://github.com/YuanChang98/tree-review to use GPT-4o-mini instead of the original use of Gemini-2.0-flash as in the original paper to ensure a better comparability of approaches irrespective of the underlying LLM.
For the fine-tuned models Reviewer2 and DeepReviewer, we use the default hyperparameters provided by the authors; we do not set the temperature to 0 since this resulted in frequent output errors (repeated tokens etc.) that could not be parsed into a review report. Reviewer2 uses two fine-tuned models151515https://huggingface.co/GitBag/Reviewer2_Mr and https://huggingface.co/GitBag/Reviewer2_Mp (last accessed on 18.07.25) based on Llama-2-7B-Chat Touvron et al. (2023) with a temperature of . DeepReviewer is based on Phi-4-14B Abdin et al. (2024)161616https://huggingface.co/WestlakeNLP/DeepReviewer-14B with a temperature of .
Appendix C Complementary Results
C.1 Detailed Statistical Testing Results
Table 5 reports the detailed statistical test results.
C.2 Sensitivity to Paper Surface Form
Table 6 reports the ROUGE-2 and Jaccard overlap of the assertions in the pairs of original and counterfactual reviews.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abdin et al. (2024) Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, and Yi Zhang. 2024. Phi-4 technical report . ar Xiv preprint ar Xiv:2412.08905 v 1 .
- 2Al Khamissi et al. (2022) Badr Al Khamissi, Millicent Li, Asli Celikyilmaz, Mona Diab, and Marjan Ghazvininejad. 2022. A review on language models as knowledge bases . ar Xiv preprint ar Xiv:2204.06031 v 1 .
- 3Anthropic (2024) Anthropic. 2024. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet . Accessed: 2025-11-05.
- 4Armstrong and Green (2022) John Scott Armstrong and Kesten C. Green. 2022. The Scientific Method: A Guide to Finding Useful Knowledge . Cambridge University Press.
- 5Association for the Advancement of Artificial Intelligence (2025) Association for the Advancement of Artificial Intelligence. 2025. AAAI launches AI‑powered peer review assessment system. https://aaai.org/aaai-launches-ai-powered-peer-review-assessment-system/ . Accessed: 2025-07-18.
- 6Bacon (1878) Francis Bacon. 1878. Novum Organum . Clarendon press.
- 7Benjamini and Hochberg (1995) Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) , 57(1):289–300.
- 8Bennett et al. (1954) Edward M. Bennett, Renee Alpert, and A. C. Goldstein. 1954. Communications through limited response questioning. Public opinion quarterly , pages 303–308.
