Better LLM Reasoning via Dual-Play
Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, Claire Cardie

TL;DR
PasoDoble introduces a dual-play framework for LLMs that trains two models adversarially—one generating challenging questions and the other solving them—without external supervision, improving reasoning capabilities.
Contribution
The paper presents PasoDoble, a novel dual-play training method for LLMs that enhances reasoning by adversarially training two models with stability improvements and no external supervision.
Findings
Improves LLM reasoning performance.
Operates without external supervision.
Enhances training stability with offline updates.
Abstract
Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to iteratively learn from themselves - thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited, largely due to their susceptibility to reward hacking and training instability. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The methodology is explained relatively clearly. - The in-domain results seem promising, despite lack of out-of-domain generalization.
- This paper explains the main methodology, reward function design and findings clearly. However, the conclusion is on weaker grounds due to insufficient baselines, inadequate methodology validation and result interpretations. - Missing important baseline: SFT model using Knowledge Base should be a critical baseline to highlight the advantages of this technique. If SFT can achieve a similar level of mathematical reasoning capacity, the value of this technique remains unclear. - Insufficient vali
- The paper proposes a Dual-Play learning framework that enhances reasoning ability by having two LLMs compete with each other. - It stabilizes the Proposer’s question generation using a knowledge base and ensures stable adversarial training through a reward design based on correctness and diversity.
- The proposed method appears unfair because it uses a knowledge base, while the baselines do not. I am particularly concerned about how much knowledge or formatting from the evaluation data may have leaked into the knowledge base. - The paper does not quantitatively show how valid the generated problems were, nor how invalid the discarded problems actually were. - Training both the Solver and the Proposer roughly doubles the computational cost compared to standard training. - The idea of improv
The paper introduces a well-defined training setup that leverages reinforcement learning with verifiable rewards to improve reasoning performance. It achieves strong empirical results: Qwen3-1.7B-Base improves by about 20 points in pass@1 accuracy, despite using limited supervision. The presentation is clear, with consistent terminology and a straightforward description of the training process. The method is evaluated on multiple math benchmarks, demonstrating solid improvements over strong base
- Several average scores reported in Table 1 are incorrect — at least six appear miscalculated (e.g., Qwen3‑1.7B Coldstart: 29.55 → 24.63; PasoDoble Offline: 47.51 → 39.59). These are not minor rounding errors, but significant numerical inconsistencies that affect the paper’s main claims. This undermines trust in the evaluation and should be corrected. - After correcting the scores, Coldstart consistently underperforms the corresponding Base models across all configurations, despite being fine
Code & Models
- 🤗PasoDoble-Cornell/Qwen3-1.7b-solver-onlinemodel
- 🤗PasoDoble-Cornell/Qwen3-0.6b-solver-onlinemodel
- 🤗PasoDoble-Cornell/Qwen2.5-1.5b-solver-onlinemodel· 1 dl1 dl
- 🤗PasoDoble-Cornell/Qwen3-1.7b-solver-offlinemodel· 2 dl2 dl
- 🤗PasoDoble-Cornell/Qwen2.5-1.5b-solver-offlinemodel· 1 dl1 dl
- 🤗PasoDoble-Cornell/Qwen2.5-0.5b-solver-offlinemodel· 2 dl2 dl
- 🤗PasoDoble-Cornell/Qwen3-0.6b-solver-offlinemodel· 1 dl1 dl
- 🤗PasoDoble-Cornell/Qwen2.5-0.5b-solver-online-newmodel· 1 dl1 dl
- 🤗PasoDoble-Cornell/Qwen2.5-3b-solver-onlinemodel· 3 dl3 dl
- 🤗PasoDoble-Cornell/Qwen2.5-3b-solver-offlinemodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
