Better LLM Reasoning via Dual-Play

Zhengxin Zhang; Chengyu Huang; Aochong Oliver Li; Claire Cardie

arXiv:2511.11881·cs.LG·January 19, 2026

Better LLM Reasoning via Dual-Play

Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, Claire Cardie

PDF

Open Access 10 Models 3 Reviews

TL;DR

PasoDoble introduces a dual-play framework for LLMs that trains two models adversarially—one generating challenging questions and the other solving them—without external supervision, improving reasoning capabilities.

Contribution

The paper presents PasoDoble, a novel dual-play training method for LLMs that enhances reasoning by adversarially training two models with stability improvements and no external supervision.

Findings

01

Improves LLM reasoning performance.

02

Operates without external supervision.

03

Enhances training stability with offline updates.

Abstract

Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to iteratively learn from themselves - thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited, largely due to their susceptibility to reward hacking and training instability. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

- The methodology is explained relatively clearly. - The in-domain results seem promising, despite lack of out-of-domain generalization.

Weaknesses

- This paper explains the main methodology, reward function design and findings clearly. However, the conclusion is on weaker grounds due to insufficient baselines, inadequate methodology validation and result interpretations. - Missing important baseline: SFT model using Knowledge Base should be a critical baseline to highlight the advantages of this technique. If SFT can achieve a similar level of mathematical reasoning capacity, the value of this technique remains unclear. - Insufficient vali

Reviewer 02Rating 2Confidence 4

Strengths

- The paper proposes a Dual-Play learning framework that enhances reasoning ability by having two LLMs compete with each other. - It stabilizes the Proposer’s question generation using a knowledge base and ensures stable adversarial training through a reward design based on correctness and diversity.

Weaknesses

- The proposed method appears unfair because it uses a knowledge base, while the baselines do not. I am particularly concerned about how much knowledge or formatting from the evaluation data may have leaked into the knowledge base. - The paper does not quantitatively show how valid the generated problems were, nor how invalid the discarded problems actually were. - Training both the Solver and the Proposer roughly doubles the computational cost compared to standard training. - The idea of improv

Reviewer 03Rating 4Confidence 4

Strengths

The paper introduces a well-defined training setup that leverages reinforcement learning with verifiable rewards to improve reasoning performance. It achieves strong empirical results: Qwen3-1.7B-Base improves by about 20 points in pass@1 accuracy, despite using limited supervision. The presentation is clear, with consistent terminology and a straightforward description of the training process. The method is evaluated on multiple math benchmarks, demonstrating solid improvements over strong base

Weaknesses

- Several average scores reported in Table 1 are incorrect — at least six appear miscalculated (e.g., Qwen3‑1.7B Coldstart: 29.55 → 24.63; PasoDoble Offline: 47.51 → 39.59). These are not minor rounding errors, but significant numerical inconsistencies that affect the paper’s main claims. This undermines trust in the evaluation and should be corrected. - After correcting the scores, Coldstart consistently underperforms the corresponding Base models across all configurations, despite being fine

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques