Fractured Chain-of-Thought Reasoning
Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, Caiming Xiong

TL;DR
This paper introduces Fractured Sampling, a novel inference strategy that balances reasoning depth and solution generation to improve the efficiency and accuracy of large language models in reasoning tasks.
Contribution
It proposes Fractured Sampling, a unified approach that interpolates between full Chain-of-Thought and solution-only sampling, optimizing inference efficiency.
Findings
Fractured Sampling outperforms traditional methods in accuracy-cost trade-offs.
It achieves steep log-linear scaling gains in Pass@k versus token budget.
The approach effectively allocates computation for scalable reasoning.
Abstract
Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number…
Peer Reviews
Decision·Submitted to ICLR 2026
The results are nice.
LLMs generate unnecessary padding tokens after reaching correct answers. I feel like this is just rebranding early stopping + ensemble methods as a "unified framework" with three "orthogonal dimensions"? - which are actually just self-consistency (n), best-of-n (m), and early stopping (H)? 1. Figure 1 shows truncated CoT is better, but Table 1 shows H=16 degrades accuracy versus H=1 when using a PRM (61.4% vs 60.4%). They need to discard the first 11 positions as "noise" to make it work. If int
Novel Method: The idea of sampling intermediate reasoning steps is novel and expands on prior inference-time techniques by introducing a new axis of diversity. Unlike conventional decoding which samples only complete solutions or final answers, Fractured Sampling explicitly fractures the reasoning process, aggregating partial reasoning outcomes. This unified framework can be seen as a fine-grained TOT approach where each branch corresponds to a partial reasoning prefix. Theoretical Analysis: Th
Complexity and Implementation Overhead: Fractured Sampling introduces additional complexity to the inference process. It requires controlling the generation to stop at multiple intermediate points and branching out multiple final answers from each, which in practice means many forward passes or a more complex decoding procedure. This could be cumbersome to implement, especially in black-box API settings where one cannot easily intervene mid-generation. The authors note that their approach assume
- The paper identifies an under-explored dimension of inference-time sampling in CoT-based LLM reasoning—not only how many reasoning chains or final answers to generate, but where in the reasoning trace sampling should occur. - The empirical evaluation spans several mathematical and scientific reasoning benchmarks, using models of various scales, showing consistent and interpretable trends. - The method operates purely at inference time without requiring retraining, making it practical for laten
- Line 128: Pass@k is introduced as one of the sampling schemes. However, Pass@k is an evaluation metric that requires access to ground-truth answers, not a sampling method. Unless I am misunderstanding, please clarify this distinction in the text. - Line 152: Mathematical notations are overloaded. The use of ε with multiple subscripts (sometimes εᵢ, sometimes εᵢⱼ) and later with a superscript (Line 169) reduces readability. Please standardize these notations to avoid confusion. - The token-bud
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Philosophy and History of Science
