Teaching Large Language Models to Reason with Reinforcement Learning

Alex Havrilla; Yuqing Du; Sharath Chandra Raparthy; Christoforos; Nalmpantis; Jane Dwivedi-Yu; Maksym Zhuravinskyi; Eric Hambro; Sainbayar; Sukhbaatar; Roberta Raileanu

arXiv:2403.04642·cs.LG·March 8, 2024·2 cites

Teaching Large Language Models to Reason with Reinforcement Learning

Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos, Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar, Sukhbaatar, Roberta Raileanu

PDF

Open Access

TL;DR

This paper evaluates various reinforcement learning algorithms, including Expert Iteration and PPO, for improving large language model reasoning, finding comparable performance and insights into sample complexity and exploration limitations.

Contribution

It provides a comparative analysis of RL algorithms for LLM reasoning, highlighting Expert Iteration's effectiveness and exploring the exploration challenges during RL training.

Findings

01

Expert Iteration performs best among tested algorithms.

02

Sample complexity of Expert Iteration is similar to PPO, around 10^6 samples.

03

RL training enhances both pass@96 and maj@1 metrics.

Abstract

Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (\textbf{SFT}) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of $1 0^{6}$ samples to converge from a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsEntropy Regularization · Shrink and Fine-Tune · Proximal Policy Optimization