Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement
Yongsheng Lian

TL;DR
This paper systematically compares PPO, GRPO, and DAPO RL algorithms for enhancing reasoning in large language models, providing insights on their performance, stability, and parameter effects through transfer learning and benchmark evaluations.
Contribution
It offers the first controlled transfer-learning evaluation of these RL algorithms on LLM reasoning, along with practical parametric guidance for training.
Findings
RL-trained models outperform base models on reasoning tasks
Increasing group size improves training stability and accuracy
Dynamic Sampling in DAPO does not enhance performance
Abstract
This study presents a systematic comparison of three Reinforcement Learning (RL) algorithms (PPO, GRPO, and DAPO) for improving complex reasoning in large language models (LLMs). Our main contribution is a controlled transfer-learning evaluation: models are first fine-tuned on the specialized Countdown Game and then assessed on a suite of general-purpose reasoning benchmarks. Across all tasks, RL-trained models outperform their corresponding base models, although the degree of improvement differs by benchmark. Our parametric analysis offers practical guidance for RL-based LLM training. Increasing the group size in GRPO and DAPO leads to more stable training dynamics and higher accuracy, while the impact of the KL-penalty coefficient is non-monotonic. Additionally, we find that the Dynamic Sampling (DS) component in DAPO does not improve performance; in fact, the best overall results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Language and cultural evolution
