OpenSIR: Open-Ended Self-Improving Reasoner
Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z. Pan, Marco Valentino, Pasquale Minervini

TL;DR
OpenSIR introduces a self-play framework enabling large language models to autonomously generate and solve novel problems, leading to continuous improvement in mathematical reasoning without external supervision.
Contribution
It presents a novel open-ended self-improving framework that co-evolves teacher and student roles for autonomous mathematical discovery and model enhancement.
Findings
Significant performance improvements on GSM8K and College Math benchmarks.
OpenSIR enables models to progress from basic to advanced mathematics autonomously.
The framework promotes diverse problem generation and adaptive difficulty calibration.
Abstract
Recent advances in large language model (LLM) reasoning through reinforcement learning rely on annotated datasets for verifiable rewards, which may limit models' ability to surpass human-level performance. While self-play offers a promising alternative, existing approaches depend on external verifiers or cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework where an LLM learns to generate and solve novel problems by alternating teacher and student roles without external supervision. To generate novel problems, OpenSIR optimises for both difficulty and diversity, rewarding problems that challenge appropriately while exploring distinct concepts, enabling open-ended mathematical discovery. Starting from a single trivial seed problem, OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on…
Peer Reviews
Decision·Submitted to ICLR 2026
- The proposed approach significantly outperforms supervised RL approaches (GRPO) and instruction-tuned baselines across a number of models and benchmarks. - The proposed approach requires no human-annotated data, reducing cost and reliance on manual labeling. - Joint optimization of teacher and student creates a self-calibrating cycle, enabling continuous self-generated training at optimal difficulty.
- In 4.1 Figure 2, the observed V-shaped difficulty trend is interesting, but the authors should provide evidence of the student model’s performance over training (e.g., accuracy or solve rate) to substantiate the claim that this pattern reflects true self-calibration. - In 2.1, the author states, “We initialise the problem pool P_0 with a single trivial problem (“What is 1+1?”)” Given the simplicity of this seed, it is worth discussing whether and how this choice constrains the initial divers
1. The paper tests its framework, OpenSIR, across multiple benchmarks (e.g., GSM8K, College Math) using various backbone LLMs, such as Llama-3.2B-Instruct and Gemma-2-2B-Instruct. This demonstrates some generality and effectiveness of the approach across different models and tasks. 2. The paper tackles an important topic—using reinforcement learning (RL) to improve LLM reasoning capabilities. RL is a compelling approach for driving autonomous learning, making this work relevant and interesting
1. The core idea of OpenSIR lacks novelty, appearing more like a combination of popular concepts (self-play, RL, curriculum learning) rather than introducing a new approach. The paper could benefit from showcasing deeper insights or unique contributions that distinguish it from existing methods. 2. The authors do not provide code or other necessary materials, making it difficult for researchers to replicate the results or experiment with the framework. Including well-documented code and resour
1.The paper's primary strength lies in its contribution to open-ended, autonomous learning. By successfully demonstrating that an LLM can bootstrap complex reasoning skills from a single trivial example without human supervision, OpenSIR presents a compelling alternative to data-intensive RLHF methods. This addresses a major bottleneck in scaling LLM capabilities and is a significant step towards more autonomous AI systems. 2.The design of the reward function for the teacher role is very effect
1.The experiments are confined to smaller models (2B-3B parameters). While the results are impressive, the paper shows minimal gains for the stronger Qwen-2.5-3B model. The authors suggest this may be due to benchmark contamination, but it could also indicate that the self-improvement process yields diminishing returns for models that are already highly capable. A discussion on the scalability of this approach to state-of-the-art models (e.g., 8B+) is a notable omission. 2.The self-play loop re
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Multimodal Machine Learning Applications · Topic Modeling
