Understanding the Dark Side of LLMs' Intrinsic Self-Correction
Qingjie Zhang, Di Wang, Haoting Qian, Yiming Li, Tianwei Zhang, Minlie Huang, Ke Xu, Hewu Li, Yan Liu, Han Qiu

TL;DR
This paper investigates the limitations of intrinsic self-correction in large language models, revealing issues like answer wavering and cognitive biases, and proposes strategies to mitigate these problems.
Contribution
The study provides the first detailed interpretation of LLMs' intrinsic self-correction failures across various tasks and models, and offers practical mitigation strategies.
Findings
Intrinsic self-correction can cause wavering in answers and prompt bias on simple questions.
It introduces human-like cognitive biases on complex tasks.
Question repetition and supervised fine-tuning can alleviate these issues.
Abstract
Intrinsic self-correction was proposed to improve LLMs' responses via feedback prompts solely based on their inherent capability. However, recent works show that LLMs' intrinsic self-correction fails without oracle labels as feedback prompts. In this paper, we aim to interpret LLMs' intrinsic self-correction for different tasks, especially for those failure cases. By including one simple task and three complex tasks with state-of-the-art (SOTA) LLMs like ChatGPT families (o1, 4o, 3.5-turbo) and Llama families (2-7B, 3-8B, and 3.1-8B), we design three interpretation methods to reveal the dark side of LLMs' intrinsic self-correction. We identify intrinsic self-correction can (1) cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions; (2) introduce human-like cognitive bias on complex tasks. In light of our findings, we also provide two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOpen Education and E-Learning · Library Science and Information Systems
MethodsLLaMA
