Reinforcement learning fine-tuning of language model for instruction following and math reasoning
Yifu Han, Geo Zhang

TL;DR
This paper explores reinforcement learning fine-tuning methods to enhance a small language model's ability to follow instructions and perform math reasoning, demonstrating improved alignment and accuracy through various techniques.
Contribution
It compares multiple RL fine-tuning approaches on a compact language model, highlighting effective strategies for instruction following and math reasoning.
Findings
RLOO with DeBERTa reward model achieves best alignment.
DPO provides consistent and strong results.
Synthetic data and external verifier improve math reasoning accuracy.
Abstract
This study investigates the effectiveness of reinforcement learning (RL) fine-tuning techniques on a compact language model (Qwen2.5-0.5B Base) for two challenging tasks: instruction following and mathematical reasoning. We compare supervised fine-tuning (SFT), Direct Preference Optimization (DPO) using preference-labeled data, and Reinforce Leave-One-Out (RLOO) with reward models. Our experiments show that RLOO with DeBERTa reward modeling achieves the best alignment, while DPO provides strong and consistent results. For math reasoing tasks, synthetic data augmentation and best-of-N sampling with an external verifier significantly improve accuracy, showing the potential of combining fine-tuning with inference-time tools. This study highlights key trade-offs and practical strategies for training lightweight, task-aligned small-scale language models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications
