SSR: Socratic Self-Refine for Large Language Model Reasoning

Haizhou Shi; Ye Liu; Bo Pang; Zeyu Leo Liu; Hao Wang; Silvio Savarese; Caiming Xiong; Yingbo Zhou; Semih Yavuz

arXiv:2511.10621·cs.CL·November 14, 2025

SSR: Socratic Self-Refine for Large Language Model Reasoning

Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang, Silvio Savarese, Caiming Xiong, Yingbo Zhou, Semih Yavuz

PDF

Open Access 3 Reviews

TL;DR

SSR introduces a fine-grained, step-level self-refinement framework for large language models, significantly improving reasoning accuracy and interpretability by decomposing responses into verifiable sub-steps and iteratively refining unreliable parts.

Contribution

The paper presents Socratic Self-Refine (SSR), a novel approach for detailed evaluation and correction of LLM reasoning, outperforming existing self-refinement methods on multiple benchmarks.

Findings

01

SSR outperforms state-of-the-art self-refinement baselines.

02

It provides interpretable reasoning chains.

03

The method enhances accuracy across five reasoning benchmarks.

Abstract

Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

- Well balanced scientific discourse and method, covering both clarity/rigour. - Sensible experimental design and empirical analysis. - Timely topic and good positioning wrt novelty.

Weaknesses

- No proper systematic qualitative analysis reflected on the methodology. D.4. provides some example outputs. - Lack of a more descriptive and formal description on the eligibility criteria (inclusion and exclusion) for the baselines, base LLMs and datasets.

Reviewer 02Rating 4Confidence 4

Strengths

No need for human annotation No need for fine tuning or training breaking the reasoning into small steps and processing those small chunks instead of long paragraphs makes the method more accurate

Weaknesses

The method section is unnecessarily complex, which makes it hard to understand, while they could have skipped some of the unnecessary mathematical notations and instead describe verbally Lack of definition of some required concepts: definition of Self-Refine method is missing in the paper (especially in the methods section) Execution time overhead is not mentioned. Considerable increase in execution time: first needs to go through the whole reasoning stage, then refine it. If one of the interme

Reviewer 03Rating 2Confidence 3

Strengths

1. Consistent gains across five reasoning tasks and multiple backbones 2. the paper is overall well-written

Weaknesses

1. the fine-grained verification increases compute cost and limits scalability to long chains or large datasets 2. step-level decomposition depends on prompting and can be noisy or inconsistent, especially for ambiguous or ill-posed sub-questions 3. the planning component assumes independence between planning and execution and uses only a single plan check, which may miss plan-level errors

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications