OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation

Shang Zhou; Wenhao Chai; Kaiyuan Liu; Huanzhi Mao; Qiuyang Mang; Jingbo Shang

arXiv:2605.15177·cs.AI·May 19, 2026

OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation

Shang Zhou, Wenhao Chai, Kaiyuan Liu, Huanzhi Mao, Qiuyang Mang, Jingbo Shang

PDF

TL;DR

OpenDeepThink introduces a population-based framework for LLM reasoning that uses pairwise comparisons and evolutionary mutation to improve answer quality without ground-truth verification.

Contribution

It proposes a novel parallel reasoning method using Bradley-Terry comparisons and mutation, enhancing LLM reasoning performance across models and domains.

Findings

01

Raises Gemini 3.1 Pro's Elo by +405 points in 8 rounds

02

Transfers across models without retuning

03

Gains concentrated in verifiable domains

Abstract

Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.