General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies

Jianxun Wang; Grant C. Forbes; Leonardo Villalobos-Arias; David L. Roberts

arXiv:2602.11087·cs.LG·February 12, 2026

General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies

Jianxun Wang, Grant C. Forbes, Leonardo Villalobos-Arias, David L. Roberts

PDF

Open Access

TL;DR

This paper introduces a flexible $f$-divergence framework for offline reinforcement learning, enabling adaptive constraints that improve performance on datasets with low diversity and multiple behavior policies.

Contribution

It develops a general LP-based formulation linking $f$-divergence to Bellman residual constraints and proposes a flexible $f$-divergence method for better offline RL performance.

Findings

01

Improved performance on MuJoCo, Fetch, and AdroitHand environments.

02

The LP formulation correctly models the relationship between divergence and Bellman residuals.

03

Flexible $f$-divergence enhances learning from challenging datasets.

Abstract

Offline RL algorithms aim to improve upon the behavior policy that produces the collected data while constraining the learned policy to be within the support of the dataset. However, practical offline datasets often contain examples with little diversity or limited exploration of the environment, and from multiple behavior policies with diverse expertise levels. Limited exploration can impair the offline RL algorithm's ability to estimate \textit{Q} or \textit{V} values, while constraining towards diverse behavior policies can be overly conservative. Such datasets call for a balance between the RL objective and behavior policy constraints. We first identify the connection between $f$ -divergence and optimization constraint on the Bellman residual through a more general Linear Programming form for RL and the convex conjugate. Following this, we introduce the general flexible function…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research