Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, David Blei

TL;DR
Duel-Evolve is a reward-free, test-time optimization method for LLM outputs that uses pairwise preferences from the same LLM to guide iterative candidate refinement, achieving significant accuracy improvements without external rewards.
Contribution
It introduces Duel-Evolve, a novel evolutionary algorithm that leverages LLM self-preferences and Bayesian modeling for test-time optimization without external rewards or labels.
Findings
20 percentage points higher accuracy on MathBench
Over 12 percentage points improvement on LiveCodeBench
No need for reward models or ground-truth labels
Abstract
Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Mobile Crowdsensing and Crowdsourcing · Stochastic Gradient Optimization Techniques
