Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Sweta Karlekar; Carolina Zheng; Magnus Saebo; Nicolas Beltran-Velez; Shuyang Yu; John Bowlan; Michal Kucer; David Blei

arXiv:2602.21585·cs.LG·February 27, 2026

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, David Blei

PDF

Open Access

TL;DR

Duel-Evolve is a reward-free, test-time optimization method for LLM outputs that uses pairwise preferences from the same LLM to guide iterative candidate refinement, achieving significant accuracy improvements without external rewards.

Contribution

It introduces Duel-Evolve, a novel evolutionary algorithm that leverages LLM self-preferences and Bayesian modeling for test-time optimization without external rewards or labels.

Findings

01

20 percentage points higher accuracy on MathBench

02

Over 12 percentage points improvement on LiveCodeBench

03

No need for reward models or ground-truth labels

Abstract

Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Mobile Crowdsensing and Crowdsourcing · Stochastic Gradient Optimization Techniques