Boosting Accuracy and Efficiency of Budget Forcing in LLMs via Reinforcement Learning for Mathematical Reasoning
Ravindra Aribowo Tarunokusumo, Rafael Fernandes Cunha

TL;DR
This paper introduces a reinforcement learning framework to enhance the accuracy and token efficiency of budget forcing in large language models for mathematical reasoning, especially on smaller models.
Contribution
It presents a novel RL-based approach that improves reasoning performance and reduces token usage, overcoming limitations of supervised fine-tuning on small models.
Findings
Over 40% reduction in token usage compared to SFT model
Improved accuracy on GSM8K dataset with limited training samples
RL recovers performance losses caused by long-context training
Abstract
Test-time scaling methods have seen a rapid increase in popularity for its computational efficiency and parameter-independent training to improve reasoning performance on Large Language Models. One such method is called budget forcing, a decoding intervention strategy which allocates extra compute budget for thinking and elicits the inherent self-correcting behavior of the model. However, this relies on supervised fine-tuning (SFT) on long-context reasoning traces which causes performance degradation on smaller models due to verbose responses. For this reason, we offer a framework integrating reinforcement learning (RL) to improve token efficiency and boost the performance of a 1.5B model for mathematical reasoning. We demonstrate this using only 1.5K training samples and found that our SFT+RL model performed better on the GSM8K dataset with varying compute budgets. Our main findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
