SSR: Socratic Self-Refine for Large Language Model Reasoning
Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang, Silvio Savarese, Caiming Xiong, Yingbo Zhou, Semih Yavuz

TL;DR
SSR introduces a fine-grained, step-level self-refinement framework for large language models, significantly improving reasoning accuracy and interpretability by decomposing responses into verifiable sub-steps and iteratively refining unreliable parts.
Contribution
The paper presents Socratic Self-Refine (SSR), a novel approach for detailed evaluation and correction of LLM reasoning, outperforming existing self-refinement methods on multiple benchmarks.
Findings
SSR outperforms state-of-the-art self-refinement baselines.
It provides interpretable reasoning chains.
The method enhances accuracy across five reasoning benchmarks.
Abstract
Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled…
Peer Reviews
Decision·Submitted to ICLR 2026
- Well balanced scientific discourse and method, covering both clarity/rigour. - Sensible experimental design and empirical analysis. - Timely topic and good positioning wrt novelty.
- No proper systematic qualitative analysis reflected on the methodology. D.4. provides some example outputs. - Lack of a more descriptive and formal description on the eligibility criteria (inclusion and exclusion) for the baselines, base LLMs and datasets.
No need for human annotation No need for fine tuning or training breaking the reasoning into small steps and processing those small chunks instead of long paragraphs makes the method more accurate
The method section is unnecessarily complex, which makes it hard to understand, while they could have skipped some of the unnecessary mathematical notations and instead describe verbally Lack of definition of some required concepts: definition of Self-Refine method is missing in the paper (especially in the methods section) Execution time overhead is not mentioned. Considerable increase in execution time: first needs to go through the whole reasoning stage, then refine it. If one of the interme
1. Consistent gains across five reasoning tasks and multiple backbones 2. the paper is overall well-written
1. the fine-grained verification increases compute cost and limits scalability to long chains or large datasets 2. step-level decomposition depends on prompting and can be noisy or inconsistent, especially for ambiguous or ill-posed sub-questions 3. the planning component assumes independence between planning and execution and uses only a single plan check, which may miss plan-level errors
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
