Fractured Chain-of-Thought Reasoning

Baohao Liao; Hanze Dong; Yuhui Xu; Doyen Sahoo; Christof Monz; Junnan Li; Caiming Xiong

arXiv:2505.12992·cs.LG·June 19, 2025

Fractured Chain-of-Thought Reasoning

Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, Caiming Xiong

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Fractured Sampling, a novel inference strategy that balances reasoning depth and solution generation to improve the efficiency and accuracy of large language models in reasoning tasks.

Contribution

It proposes Fractured Sampling, a unified approach that interpolates between full Chain-of-Thought and solution-only sampling, optimizing inference efficiency.

Findings

01

Fractured Sampling outperforms traditional methods in accuracy-cost trade-offs.

02

It achieves steep log-linear scaling gains in Pass@k versus token budget.

03

The approach effectively allocates computation for scalable reasoning.

Abstract

Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

The results are nice.

Weaknesses

LLMs generate unnecessary padding tokens after reaching correct answers. I feel like this is just rebranding early stopping + ensemble methods as a "unified framework" with three "orthogonal dimensions"? - which are actually just self-consistency (n), best-of-n (m), and early stopping (H)? 1. Figure 1 shows truncated CoT is better, but Table 1 shows H=16 degrades accuracy versus H=1 when using a PRM (61.4% vs 60.4%). They need to discard the first 11 positions as "noise" to make it work. If int

Reviewer 02Rating 4Confidence 3

Strengths

Novel Method: The idea of sampling intermediate reasoning steps is novel and expands on prior inference-time techniques by introducing a new axis of diversity. Unlike conventional decoding which samples only complete solutions or final answers, Fractured Sampling explicitly fractures the reasoning process, aggregating partial reasoning outcomes. This unified framework can be seen as a fine-grained TOT approach where each branch corresponds to a partial reasoning prefix. Theoretical Analysis: Th

Weaknesses

Complexity and Implementation Overhead: Fractured Sampling introduces additional complexity to the inference process. It requires controlling the generation to stop at multiple intermediate points and branching out multiple final answers from each, which in practice means many forward passes or a more complex decoding procedure. This could be cumbersome to implement, especially in black-box API settings where one cannot easily intervene mid-generation. The authors note that their approach assume

Reviewer 03Rating 2Confidence 4

Strengths

- The paper identifies an under-explored dimension of inference-time sampling in CoT-based LLM reasoning—not only how many reasoning chains or final answers to generate, but where in the reasoning trace sampling should occur. - The empirical evaluation spans several mathematical and scientific reasoning benchmarks, using models of various scales, showing consistent and interpretable trends. - The method operates purely at inference time without requiring retraining, making it practical for laten

Weaknesses

- Line 128: Pass@k is introduced as one of the sampling schemes. However, Pass@k is an evaluation metric that requires access to ground-truth answers, not a sampling method. Unless I am misunderstanding, please clarify this distinction in the text. - Line 152: Mathematical notations are overloaded. The use of ε with multiple subscripts (sometimes εᵢ, sometimes εᵢⱼ) and later with a superscript (Line 169) reduces readability. Please standardize these notations to avoid confusion. - The token-bud

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Philosophy and History of Science