Beyond KL Divergence: Policy Optimization with Flexible Bregman Divergences for LLM Reasoning

Rui Yuan; Mykola Khandoga; Vinay Kumar Sankarapu

arXiv:2602.04380·cs.LG·February 5, 2026

Beyond KL Divergence: Policy Optimization with Flexible Bregman Divergences for LLM Reasoning

Rui Yuan, Mykola Khandoga, Vinay Kumar Sankarapu

PDF

Open Access

TL;DR

This paper introduces GBMPO, a flexible framework for policy optimization in LLM reasoning that extends beyond KL divergence to include Bregman divergences, leading to improved performance and new insights.

Contribution

The paper proposes GBMPO, a novel framework that incorporates various Bregman divergences into group-based policy optimization, expanding the divergence choices beyond KL and demonstrating their impact.

Findings

01

ProbL2-GRPO achieves 86.7% accuracy on GSM8K, outperforming baseline.

02

Neural mirror maps reach 60.1-60.8% pass@1 on MBPP, with benefits from random initialization.

03

Variance reduction and efficiency gains are observed with meta-learning and neural mirror maps.

Abstract

Policy optimization methods like Group Relative Policy Optimization (GRPO) and its variants have achieved strong results on mathematical reasoning and code generation tasks. Despite extensive exploration of reward processing strategies and training dynamics, all existing group-based methods exclusively use KL divergence for policy regularization, leaving the choice of divergence function unexplored. We introduce Group-Based Mirror Policy Optimization (GBMPO), a framework that extends group-based policy optimization to flexible Bregman divergences, including hand-designed alternatives (L2 in probability space) and learned neural mirror maps. On GSM8K mathematical reasoning, hand-designed ProbL2-GRPO achieves 86.7% accuracy, improving +5.5 points over the Dr. GRPO baseline. On MBPP code generation, neural mirror maps reach 60.1-60.8% pass@1, with random initialization already capturing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Reinforcement Learning in Robotics · Adversarial Robustness in Machine Learning