Self-Questioning Language Models
Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, Deepak Pathak

TL;DR
This paper introduces Self-Questioning Language Models (SQLM), a framework where models generate and solve their own questions to improve reasoning skills without external data, using reinforcement learning.
Contribution
It proposes an innovative self-play framework for language models that enhances reasoning abilities through autonomous question generation and solving, without relying on additional datasets.
Findings
Models improve on benchmarks by generating and solving their own problems.
The framework works across algebra, multiplication, and programming tasks.
Reinforcement learning effectively trains the self-questioning process.
Abstract
Can large language models improve without external data -- by generating their own questions and answers? We hypothesize that a pre-trained language model can improve its reasoning skills given only a single prompt specifying the topic (e.g., algebra word problems) and asking the model to generate its own questions. To do this, we propose Self-Questioning Language Models (SQLM): an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver, who tries to answer it. Both the proposer and solver are trained via reinforcement learning. The proposer receives a reward if the problem is not too easy or too difficult, and the solver receives a reward based on majority voting, a proxy for correctness in the absence of ground-truth answers. For coding, the proposer can instead generate unit tests which are used for verification. We study this…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Timeliness and relevance of the topic. Label-free learning loops for supporting reasoning tasks. - Overall clear narrative.
- Narrow experimental design. Lack of clear selection criteria for the target tasks. Limited baselines. Comparisons focus on simple alternatives (e.g., format-reward) rather than strong self-training/self-play methods under matched conditions. - Lack of a better positioning of the contributions of the work wrt to related work.
I agree the approach described of leveraging LLMs to generate synthetic data to finetune on is very promising and successful and should be explored further. The paper describes an implementation of that approach and shows it works on some simple datasets.
Leveraging LLMs for synthetic data generation to finetune on is a well known and successful approach, and the paper doesn't seem to me to propose anything novel to the technique. The paper provides very limited experimental results on very simple datasets. I feel like in 2022 or 2023 this paper would have been considered novel, but in 2025 this approach has been pushed quite a bit farther by many other papers over the last couple years.
The strengths of this paper lie in its interesting perspective on discriminator-generator training for LLMs in order to "bootstrap" reasoning ability. S1. While the idea of training neural networks with discriminator-generator min-max style games is not novel, applying this to LLM reasoning as an RL objective is a promising approach. S2. The method is explained clearly and avoids indulging in unnecessary complexity. S3. The paper demonstrates statistically significant improvement in math and
While the core idea of asymmetric self-play for LLM training without external data is novel, the paper has a mismatch between its claimed contributions and experimental validation. The experiments are conducted on domains where the proposed approach is least needed, and lack sufficient analysis to demonstrate practical significance. W1. The paper claims to address settings with scarce training data and difficult verification. However, all experiments use arithmetic, algebra, and LeetCode easy p
- The paper is well-written with clear notations and equations. - General framework: The proposed approach is conceptually general and, in principle, extendable to various tasks such as reasoning, planning, and code generation.
- Limited experimental scale: The experiments are conducted only on Qwen 2.5 3B instruct models, lacking results for smaller or larger, or different model families of models. - Potentially noisy reward signals: The majority-vote reward may reinforce model self-consistency bias, potentially amplifying incorrect consensus among sampled outputs. This could make the model learn inherent biases that are hard to correct later on. - It seems that the reward is constant for proposer in equation 2 as lon
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
