Scalable Oversight for Superhuman AI via Recursive Self-Critiquing
Xueru Wen, Jie Lou, Xinyu Lu, Junjie Yang, Yanjiang Liu, Yaojie Lu, Debing Zhang, Xing Yu

TL;DR
This paper proposes a recursive self-critiquing approach to AI oversight, enabling scalable supervision of superhuman AI by leveraging the idea that critiquing critiques is easier than direct evaluation, especially when AI outputs surpass human ability.
Contribution
It introduces the concept of recursive self-critiquing as a novel method for scalable AI oversight, extending verification principles to critique and higher-order critiques.
Findings
Recursive critique improves oversight scalability.
Experiments show effectiveness of AI-AI critique interactions.
Higher-order critiques outperform direct evaluation in complex tasks.
Abstract
As AI capabilities increasingly surpass human proficiency in complex tasks, current alignment techniques, including SFT and RLHF, face fundamental challenges in ensuring reliable oversight. These methods rely on direct human assessment and become impractical when AI outputs exceed human cognitive thresholds. In response to this challenge, we explore two hypotheses: (1) \textit{Critique of critique can be easier than critique itself}, extending the widely-accepted observation that verification is easier than generation to the critique domain, as critique itself is a specialized form of generation; (2) \textit{This difficulty relationship holds recursively}, suggesting that when direct evaluation is infeasible, performing higher-order critiques (e.g., critique of critique of critique) offers a more tractable supervision pathway. We conduct Human-Human, Human-AI, and AI-AI experiments to…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. This paper focuses on a very important topic, scalable oversight, and proposes an interesting and insightful assumption: Critique of critique can be easier than critique itself 2. The authors conducted extensive experiments to verify these assumptions. Particularly, I appreciate the human and human-AI interactions study. 3. On the DeepScaleR dataset, the improvement is significant and consistent.
1. The effectiveness of the proposed method is quite limited. In the main body, only the DeepScaleR dataset is used. Moreover, I found in Appendix C, popular datasets for scalable oversight, e.g., GPQA and MMLU, are actually used and compared, but there is no significant/consistent improvement. This questions the generalization performance of the proposed method. 2. Besides majority voting, the authors didn’t compare any other scalable oversight baselines, nor did they justify the reason. For e
- **Novel, interesting idea:** The idea of extending response critique to critique-critique and recursively onward is interesting and novel. - **Consistently positive results**: The results are consistently positive for human-human and human-AI recursive critiquing, and provide some interesting results for AI self-supervision, opening up possibilities for bootstrapping stronger models from weaker ones. - **Diverse datasets:** The paper covers a diverse range of datasets, including Eng
- **Clarity of tables 1 and 2:** If I understand correctly, the C^2 and C^3 result in the majority voting column in each of these tables is just copied from the accuracy column in order to allow a vertical comparison. Rather than doing this, you should just leave that part of that column blank. If I am misunderstanding, then I don’t know what C^2 or C^3 is for the Majority Voting column or why it is identical to the accuracy column. - **Small(ish) models**: A very minor weakness: the human
1) The problem in study is interesting and important. The paper tackles scalable oversight when direct human evaluation becomes infeasible, articulating the hypothesis that “critique of critique” is easier than critique and exploring its implications for alignment and supervision workflows . 2) The paper is well written and easy to follow. The protocol is clearly specified (R → C1 → C2 → C3) and concise. 3) Both AI and human studies are conducted.
1) The scope and strength of AI evaluations are limited. Human–AI experiments consider Qwen2.5-7B and 72B, and supplemental AI–AI experiments use Gemma2-9B and Qwen2.5-14B; frontier or specialized reasoning models are not included. 2) Possible test-time scaling confounds remain. Although the paper compares against effort-equivalent majority voting, token-level and call-level budgets can still differ across stages and models. I wonder whether the performance lift purely come from test-time scal
The paper demonstrates several notable strengths: 1. Originality: The extension of the "verification is easier than generation" principle to recursive critique represents a creative and novel approach to the scalable oversight problem. The recursive formulation provides a new pathway for supervision when direct evaluation becomes infeasible. 2. Quality of Experiments: The experimental design is comprehensive and methodical. The progression from Human-Human to Human-AI to AI-AI experiments crea
1. Limited Task Diversity: Although the paper includes five different tasks, they are all in the domain of academic-style problems (language comprehension, mathematics, logical reasoning). The effectiveness of recursive critique in more creative or open-ended domains remains unexplored. Including tasks with more subjective evaluation criteria would strengthen the generalizability of the findings. 2. Scalability Concerns: The paper doesn't adequately address potential scalability issues with rec
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsShrink and Fine-Tune
