Rethinking Optimal Transport in Offline Reinforcement Learning
Arip Asadulaev, Rostislav Korst, Alexander Korotin, Vage Egiazarian,, Andrey Filchenkov, Evgeny Burnaev

TL;DR
This paper introduces a novel offline reinforcement learning algorithm based on optimal transport, which stitches together the best behaviors from diverse datasets to improve policy performance in continuous control tasks.
Contribution
It redefines offline RL as an optimal transport problem and proposes a new algorithm to extract policies that map states to distributions of optimal expert actions.
Findings
Demonstrates improved performance over existing methods on D4RL benchmarks.
Effectively extracts policies that focus on the best behaviors in sub-optimal datasets.
Validates the approach on continuous control problems with empirical results.
Abstract
We propose a novel algorithm for offline reinforcement learning using optimal transport. Typically, in offline reinforcement learning, the data is provided by various experts and some of them can be sub-optimal. To extract an efficient policy, it is necessary to \emph{stitch} the best behaviors from the dataset. To address this problem, we rethink offline reinforcement learning as an optimal transportation problem. And based on this, we present an algorithm that aims to find a policy that maps states to a \emph{partial} distribution of the best expert actions for each given state. We evaluate the performance of our algorithm on continuous control problems from the D4RL suite and demonstrate improvements over existing methods.
Peer Reviews
Decision·NeurIPS 2024 poster
(1) While a lot of methods that use OT, use Wasserstein distance and that requires optimizing a function constraint to be Lipschitz. This is often hard. The authors have used a maxmin formulation which does not need the function to be Lipschitz. (2) They treat offline RL as an OT problem rather than using OT as a regularizer.
(1) There should have been comparisons to W-BRAC. (2) The results show that PPL^{CQL} produces some marginal improvement over PPL and PPL^{R} with ReBRAC. (3) From Equation 12, you must not need \beta. But in experiments you constantly talk about being in conjugation with something or training a \beta. Maybe this wasn't clear or I misunderstood, why do you need to be in conjugation with CQL or ReBRAC or one-step RL? Why can't you simply train the maxmin objective in equation 12. The work is
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Optimization and Search Problems
