Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning
Haonan Wang, Chao Du, Kenji Kawaguchi, Tianyu Pang

TL;DR
ThinkMerge is a decoding strategy that averages logits from multiple reasoning traces to improve open-ended question answering and reasoning tasks, outperforming majority voting in various benchmarks.
Contribution
The paper introduces ThinkMerge, a training-free, plug-and-play decoding method that enhances open-ended reasoning by averaging logits from parallel traces, without requiring complete output voting.
Findings
Matches or surpasses majority voting on AIME and GPQA.
Improves pass@1 by +8.28% on DeepCoder-14B-Preview.
Enhances web-based research agents across multiple benchmarks.
Abstract
Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a "majority" over complete solutions is ill-defined. We introduce ThinkMerge, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. ThinkMerge integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that…
Peer Reviews
Decision·ICLR 2026 Poster
Clear implementation of a straightforward approach to improve reasoning model performance. Method delivers a clear improvement Well-described and easy to followN/A I think this is a solid paper.
No discussion of closed models here. I’m curious, how well does this close the gap between the open and closed models?
Strengths: - The proposed method is quite simple to implement, doesn't require any training, and practically deployable on popular inference engines. - The technique addresses a limitation of majority voting in open domains, where the response is free form text (code/reports etc) where "majority vote" is ill-defined. - The four ablations seem well-motivated, and grounded in practical considerations.
Weaknesses: - My major concern is with the limited innovation/novelty. The proposed technique is a straight forward application of product of experts ensembling (See [1]). Within the context of application in LLMs, there is prior work in this area: [2] applies the exact same technique of token fusion in the context of having several prompt variants. They too generate logits by processing each prompt and then averaging the logits at each auto-regressive step. ThinkMerge can be thought of as a sp
- Addresses a real gap: parallel CoT doesn’t work for open-ended outputs; this makes it usable there. - token-level logit averaging after a shared delimiter, easy to plug into vLLM/SGLang. - Evaluates on both closed-ended (AIME, GPQA) and open-ended (LiveCodeBench), so it’s not overfitted to math-only. - Keeps contexts synchronized while decoding, which is a nice engineering detail for practicality.
- This looks to be delimiter-dependent. It assumes clean think/answer separation; models that ramble or reflect will hurt alignment. - On AIME/GPQA it sometimes only matches or even loses to majority voting. - I’m skeptical about pure logit averaging, since it can dilute a minority-but-correct trace. A simple reweighting or confidence-based scheme might help; at minimum I’d like to see an ablation showing how often correct traces get downvoted by the ensemble - The method gives clear gains at sm
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
