TL;DR
This paper introduces Value Gradient Flow (VGF), a scalable reinforcement learning method that uses optimal transport to regularize behavior, outperforming prior approaches on offline RL and LLM tasks.
Contribution
VGF reformulates behavior-regularized RL as an optimal transport problem solved via gradient flow, eliminating explicit policy parameterization and enabling adaptive scaling.
Findings
VGF achieves state-of-the-art results on offline RL benchmarks.
VGF outperforms existing methods on LLM RL tasks.
VGF provides a scalable and flexible framework for behavior-regularized RL.
Abstract
We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly conservative when attempting to move beyond the behavior support. In this paper, we propose Value Gradient Flow (VGF), a scalable new paradigm for behavior-regularized RL. VGF casts behavior-regularized RL as an optimal transport problem that maps the reference distribution to the value-induced optimal policy distribution. We solve this transport problem via discrete gradient flow, where value gradients guide particles initialized from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
