GPG: Generalized Policy Gradient Theorem for Transformer-based Policies

Hangyu Mao; Guangting Dong; and Zhicheng Dou

arXiv:2512.10365·cs.LG·December 12, 2025

GPG: Generalized Policy Gradient Theorem for Transformer-based Policies

Hangyu Mao, Guangting Dong, and Zhicheng Dou

PDF

Open Access

TL;DR

This paper introduces the Generalized Policy Gradient (GPG) Theorem tailored for Transformer-based policies, unifying existing policy gradient methods and enhancing training efficiency for large language models.

Contribution

The paper presents a new GPG framework that generalizes existing policy gradient theorems and applies it to improve training of Transformer-based policies in LLMs.

Findings

01

GPG unifies standard Policy Gradient and GRPO as special cases.

02

Application of GPG improves policy optimization in LLM training.

03

Provides theoretical insights into Transformer policy gradients.

Abstract

We present the Generalized Policy Gradient (GPG) Theorem, specifically designed for Transformer-based policies. Notably, we demonstrate that both standard Policy Gradient Theorem and GRPO emerge as special cases within our GPG framework. Furthermore, we explore its practical applications in training Large Language Models (LLMs), offering new insights into efficient policy optimization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Reinforcement Learning in Robotics · Machine Learning and Algorithms