The Sequential Edge: Inverse-Entropy Voting Beats Parallel Self-Consistency at Matched Compute
Aman Sharma, Paras Chopra

TL;DR
This paper demonstrates that sequential reasoning with inverse-entropy voting significantly outperforms parallel self-consistency in language model inference, suggesting a paradigm shift in test-time scaling strategies.
Contribution
It introduces inverse-entropy weighted voting and establishes sequential refinement as the superior inference method over parallel self-consistency.
Findings
Sequential scaling outperforms parallel self-consistency in 95.6% of configurations.
Inverse-entropy voting boosts accuracy further over majority voting.
Sequential reasoning achieves up to 46.7% accuracy gains.
Abstract
We revisit test-time scaling for language model reasoning and ask a fundamental question: at equal token budget and compute, is it better to run multiple independent chains in parallel, or to run fewer chains that iteratively refine through sequential steps? Through comprehensive evaluation across 5 state-of-the-art open source models and 3 challenging reasoning benchmarks, we find that sequential scaling where chains explicitly build upon previous attempts consistently outperforms the dominant parallel self-consistency paradigm in 95.6% of configurations with gains in accuracy upto 46.7%. Further, we introduce inverse-entropy weighted voting, a novel training-free method to further boost the accuracy of sequential scaling. By weighing answers in proportion to the inverse entropy of their reasoning chains, we increase our success rate over parallel majority and establish it as the…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Claim clearly presented with wide variety of evidence. The author claims that sequential self-refinement beats parallel self-consistency at matched token budgets. The claim is supported by supported by results across 5 models, 3 benchmarks, and multiple chain counts (3/6/9), and indeed show the higher accuracy in almost all the settings. The wide range of configurations ensures the generalizability of the claim. 2. Training-free and cross-model. The author avoid additional fine-tuning and sho
1. Hypothesis on token-level entropy and model confidence as a metric to weigh the chains’ quality is not verified. The author proposed to use token-wise entropy as a weighing factor for generated chains. The critical assumption here is model confidence is positively correlated with quality or correctness of the response. It has been a common phenomenon that model tends to generated confidently the wrong answer under certain given prompt. The test to verify the effectiveness of token level entro
* The paper offers near-universal evidence (95.6% win rate) that sequential reasoning outperforms the parallel method (Self-Consistency) across diverse LLMs and complex reasoning tasks * The technical contribution of Inverse-Entropy Weighted (IEW) Voting is elegant and training-free, providing a principled way to leverage the LLM's inherent uncertainty (via logprobs) to aggregate results. * The paper is in an important area, and we definitely need more analysis and interesting studies about de
* The paper acknowledges that sequential, serial execution has a substantial wall-clock time overhead compared to parallel methods, making it challenging for real-time applications * The core advantage is hypothesized to come from Error Correction and Context Accumulation, but the experiments do not empirically decouple and quantify the contribution of these two distinct mechanisms * The Creative Tasks ablation shows a divergent trade-off (Sequential: high lexical diversity; Parallel: high sem
1, This paper challenges a widely accepted inference-time scaling orthodoxy (parallel self-consistency) with compelling evidence favoring sequential reasoning. 2, Controlled matched-compute setup and multi-model, multi-domain evaluation ensure fairness and reproducibility. 3, Inverse-entropy voting introduces a principled, information-theoretic mechanism that improves upon heuristic majority voting. 4, This paper demonstrates generality across reasoning, scientific, and creative tasks, reinfo
1, More related works should be discussed. e.g. https://aclanthology.org/2024.findings-emnlp.135.pdf, https://arxiv.org/abs/2401.02009, https://arxiv.org/abs/2308.00436. For example, at the same cost, does the proposed method perform better than mirror-consistency, self-contrast & self-check? 2, The main benchmarks (AIME, GPQA) focus on mathematical and scientific reasoning; inclusion of commonsense or real-world tasks (e.g., MMLU, GSM8K) would further support generality. 3, Self-refinement is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
