A Technical Survey of Reinforcement Learning Techniques for Large Language Models
Saksham Sahai Srivastava, Vaneet Aggarwal

TL;DR
This survey reviews reinforcement learning techniques applied to large language models, emphasizing algorithms like RLHF and DPO, analyzing their applications, challenges, and future directions for improving alignment and reasoning.
Contribution
It provides a comprehensive technical overview and taxonomy of RL methods for LLMs, highlighting current trends, challenges, and emerging research directions.
Findings
RLHF remains the dominant method for alignment.
Outcome-based RL improves stepwise reasoning.
Challenges include reward hacking and high computational costs.
Abstract
Reinforcement Learning (RL) has emerged as a transformative approach for aligning and enhancing Large Language Models (LLMs), addressing critical challenges in instruction following, ethical alignment, and reasoning capabilities. This survey offers a comprehensive foundation on the integration of RL with language models, highlighting prominent algorithms such as Proximal Policy Optimization (PPO), Q-Learning, and Actor-Critic methods. Additionally, it provides an extensive technical overview of RL techniques specifically tailored for LLMs, including foundational methods like Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF), as well as advanced strategies such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO). We systematically analyze their applications across domains, i.e., from code generation to tool-augmented reasoning. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
