Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models

Ken Tsui

arXiv:2507.02778·cs.CL·October 7, 2025

Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models

Ken Tsui

PDF

Open Access 4 Datasets 3 Reviews

TL;DR

This paper identifies a systematic failure in large language models where they cannot correct their own errors, introduces an evaluation framework to measure this issue, and proposes a simple prompt modification to significantly reduce the problem.

Contribution

The paper uncovers the Self-Correction Blind Spot in LLMs, introduces Self-Correction Bench for evaluation, and demonstrates a prompt-based method to mitigate the issue.

Findings

01

Average 64.5% blind spot rate across models

02

Appending a 'Wait' prompt reduces blind spots by 89.3%

03

Training data influences error correction capabilities

Abstract

Although large language models (LLMs) have transformed AI, they still make mistakes and can explore unproductive reasoning paths. Self-correction capability is essential for deploying LLMs in safety-critical applications. We uncover a systematic failure: LLMs cannot correct errors in their own outputs while successfully correcting identical errors from external sources - a limitation we term the Self-Correction Blind Spot. To study this phenomenon, we introduce Self-Correction Bench, an evaluation framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 open-source non-reasoning models, we find an average 64.5% blind spot rate. We provide multiple lines of evidence suggesting this limitation may be influenced by training data: human demonstrations rarely include error-correction sequences (favoring error-free responses), whereas…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

There are several strengths to the manuscript: 1. Rather clear isolation of the problem and extensive empirical analysis of multiple models on the behavior. The experiment design seems reasonable, where exact same sequence of tokens is presented but with different attribution to observe differing model behavior. 2. The introduction of a toy benchmark (SCLI5) to explain clearly the phenomena, followed by studying on real-world benchmarks. 3. Identifying an intervention, e.g. the word "wait".

Weaknesses

I believe the overall contribution is somewhat naive and does not go sufficiently in-depth in understanding the mechanics of the model behavior. The weaknesses I would like to hear the author's opinion are: 1. The entire work is prompt engineering - from the detection to the solution. The authors acknowledge this, but I find it necessary to go a step further and point out some training recipe changes that mitigates this to some degree. Naive fine-tuning with the word 'wait' may not work, while

Reviewer 02Rating 2Confidence 4

Strengths

1. For originality and significance, I think the newly introduced evaluation framework is a nice addition to existing works. 2. The paper is generally understandable, but I need more explanation/analysis described in the weaknesses and questions sections below.

Weaknesses

1. The paper discusses self-correction blind spot on LLMs, but closed-source LLMs are unfortunately not studied. Although it is explained as "close-source models lack support for fine-grained control of prefix inject critical for our methodology" in line 236, I do not think this would stop you from studying closed-source models. You may want to analyze their reasoning chains directly and compare model outputs side-by-side. Otherwise, the findings of this paper is very limited. 2. The analysis o

Reviewer 03Rating 4Confidence 4

Strengths

1. Dataset contribution specifically focused on self-correction (or "correction", see weakness 1) by asking models to determine whether the given answer or reasoning in the prompt needs to be corrected, and indeed make the correction. This allows for the analysis in the paper to examine the correction ability of both reasoning and non-reasoning LLMs. 2. Empirical comparative analysis of the correction capability of reasoning and non-reasoning LLMs reveals that one strong differentiating factor

Weaknesses

1. The dataset does not strictly study self-correction. By the construction description, all 3 sub-datasets were generated by inserting wrong answers or reasoning traces into a given prompt from standard datasets (with possibly a short sequence of model output) using off-the-self closed-source models (such as GPT 4.1). Since the incorrect answers were not generated by the models-under-test (open-source models with fewer parameters), it is unknown how likely they are under each model's own sampli

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques