SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data

Wenkai Fang; Shunyu Liu; Yang Zhou; Kongcheng Zhang; Tongya Zheng; Kaixuan Chen; Mingli Song; Dacheng Tao

arXiv:2505.20347·cs.CL·January 27, 2026

SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data

Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, Dacheng Tao

PDF

Open Access 1 Repo 1 Video

TL;DR

SeRL introduces a self-play reinforcement learning approach that enables large language models to improve reasoning skills with limited data by generating instructions and rewards internally, reducing reliance on external annotations.

Contribution

SeRL presents a novel self-play RL framework with self-instruction and self-reward modules, effectively training LLMs in data-scarce environments.

Findings

01

Outperforms existing methods on reasoning benchmarks

02

Achieves results comparable to high-quality data approaches

03

Effective in diverse LLM architectures

Abstract

Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose Self-play Reinforcement Learning (SeRL) to bootstrap LLM training with limited initial data. Specifically, SeRL comprises two complementary modules: self-instruction and self-rewarding. The former module generates additional instructions based on the available data at each training step, employing robust online filtering strategies to ensure instruction quality, diversity, and difficulty. The latter module introduces a simple yet effective majority-voting mechanism to estimate response rewards for additional instructions,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wantbook-book/serl
pytorchOfficial

Videos

SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data· slideslive

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Natural Language Processing Techniques