TL;DR
This paper introduces SePT, a self-training method enabling language models to enhance their reasoning abilities solely through self-generated responses without external rewards, demonstrated across multiple math benchmarks.
Contribution
The paper presents SePT, a novel self-evolving post-training approach that improves reasoning performance using only self-sampled data and online data refresh mechanisms.
Findings
SePT improves reasoning performance on six math benchmarks.
Online data refresh and temperature dynamics are crucial for success.
Self-training alone can significantly enhance model reasoning without external rewards.
Abstract
Can language models improve their reasoning performance without external rewards, using only their own sampled responses for training? We show that they can. We propose Self-evolving Post-Training (SePT), a simple post-training method that alternates between self-generation and training on self-generated responses. It repeatedly samples questions, uses the model itself to generate responses under a specified sampling temperature, and then trains the model on the self-generated data. In this self-training loop, we use an online data refresh mechanism, where each new batch is generated by the most recently updated model. Across six math reasoning benchmarks, SePT improves a strong no-training baseline, defined as the untuned base model evaluated at its best swept decoding temperature, on several tested models. Additional ablations demonstrate the importance of online data refresh and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
