Reinforcement learning fine-tuning of language model for instruction following and math reasoning

Yifu Han; Geo Zhang

arXiv:2506.21560·cs.CL·July 29, 2025

Reinforcement learning fine-tuning of language model for instruction following and math reasoning

Yifu Han, Geo Zhang

PDF

Open Access

TL;DR

This paper explores reinforcement learning fine-tuning methods to enhance a small language model's ability to follow instructions and perform math reasoning, demonstrating improved alignment and accuracy through various techniques.

Contribution

It compares multiple RL fine-tuning approaches on a compact language model, highlighting effective strategies for instruction following and math reasoning.

Findings

01

RLOO with DeBERTa reward model achieves best alignment.

02

DPO provides consistent and strong results.

03

Synthetic data and external verifier improve math reasoning accuracy.

Abstract

This study investigates the effectiveness of reinforcement learning (RL) fine-tuning techniques on a compact language model (Qwen2.5-0.5B Base) for two challenging tasks: instruction following and mathematical reasoning. We compare supervised fine-tuning (SFT), Direct Preference Optimization (DPO) using preference-labeled data, and Reinforce Leave-One-Out (RLOO) with reward models. Our experiments show that RLOO with DeBERTa reward modeling achieves the best alignment, while DPO provides strong and consistent results. For math reasoing tasks, synthetic data augmentation and best-of-N sampling with an external verifier significantly improve accuracy, showing the potential of combining fine-tuning with inference-time tools. This study highlights key trade-offs and practical strategies for training lightweight, task-aligned small-scale language models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications