OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation
Shang Zhou, Wenhao Chai, Kaiyuan Liu, Huanzhi Mao, Qiuyang Mang, Jingbo Shang

TL;DR
OpenDeepThink introduces a population-based framework for LLM reasoning that uses pairwise comparisons and evolutionary mutation to improve answer quality without ground-truth verification.
Contribution
It proposes a novel parallel reasoning method using Bradley-Terry comparisons and mutation, enhancing LLM reasoning performance across models and domains.
Findings
Raises Gemini 3.1 Pro's Elo by +405 points in 8 rounds
Transfers across models without retuning
Gains concentrated in verifiable domains
Abstract
Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
