GPG: Generalized Policy Gradient Theorem for Transformer-based Policies
Hangyu Mao, Guangting Dong, and Zhicheng Dou

TL;DR
This paper introduces the Generalized Policy Gradient (GPG) Theorem tailored for Transformer-based policies, unifying existing policy gradient methods and enhancing training efficiency for large language models.
Contribution
The paper presents a new GPG framework that generalizes existing policy gradient theorems and applies it to improve training of Transformer-based policies in LLMs.
Findings
GPG unifies standard Policy Gradient and GRPO as special cases.
Application of GPG improves policy optimization in LLM training.
Provides theoretical insights into Transformer policy gradients.
Abstract
We present the Generalized Policy Gradient (GPG) Theorem, specifically designed for Transformer-based policies. Notably, we demonstrate that both standard Policy Gradient Theorem and GRPO emerge as special cases within our GPG framework. Furthermore, we explore its practical applications in training Large Language Models (LLMs), offering new insights into efficient policy optimization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Reinforcement Learning in Robotics · Machine Learning and Algorithms
