Single-Trajectory Distributionally Robust Reinforcement Learning

Zhipeng Liang; Xiaoteng Ma; Jose Blanchet; Jiheng Zhang; Zhengyuan; Zhou

arXiv:2301.11721·stat.ML·September 24, 2024·1 cites

Single-Trajectory Distributionally Robust Reinforcement Learning

Zhipeng Liang, Xiaoteng Ma, Jose Blanchet, Jiheng Zhang, Zhengyuan, Zhou

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel model-free distributionally robust reinforcement learning algorithm, DRQ, capable of learning robust policies from a single trajectory with proven convergence and superior performance.

Contribution

It presents the first fully model-free DRRL algorithm that learns from a single trajectory, utilizing a multi-timescale framework and providing convergence guarantees.

Findings

01

Demonstrates superior robustness over non-robust methods.

02

Shows improved sample efficiency in experiments.

03

Validates convergence through theoretical analysis.

Abstract

To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. In this paper, we design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without…

Peer Reviews

Decision·ICML 2024 Poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

- Distributionally robust reinforcement learning is a relevant and active area of research. While a few methods have been already proposed, the paper is novel in that it considers a model-free setting, without assuming access to a simulator but only to single trajectory data. I believe this is significantly more practical and makes a good contribution. - A reasonable amount of experiments illustrate the features of the proposed approach, compared to existing DRRL baselines. - The paper is nice

Weaknesses

- Only asymptotic convergence is proven, and no sample-complexity guarantees. Do the authors have a guess on how these may compare with previous work, e.g. (Panaganti et al. 2022)? - In Section 4.2, a practical implementation of DQR is utilized for the experiments. How is this implemented? I would be nice to discuss such implementation in the main text.

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

1. It targets an interesting problem: design a model-free algorithm for distributionally robust RL problems. 2. A three-timescale algorithm has been proposed that enjoys an asymptotic guarantee and practical sample efficiency. 3. The introduction of the algorithm is clear and easy to follow.

Weaknesses

1. For the experiments in Figure 5. It seems the proposed algorithm DDQR has a very similar performance compared to the existing robust algorithm SR-DQN. It will be helpful to add more discussion about this. 2. As mentioned in the algorithm, the three-timescale serves as a key role in the algorithm to ensure convergence. So it will be better to introduce what is the three learning rates that the practical algorithm uses. And ablation study using different learning rates will give a message about

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

- This paper considers distributionally robust MDP using f-divergence as the uncertainty set, which is novel. - The motivations of not using SAA to approximate robust bellman equation are clear. - The reasons of not using multilevel Monte-Carlo method are clear.

Weaknesses

- I guess the paper was written in parallel with the Panaganti 2022 (Robust offline) paper. However, as the Panaganti 2022 paper is published for over 6 months, it is better to compare and discuss the difference choices of uncertainty sets and problem settings. The authors claim that this paper is the first model-free DR RL paper in the literature, which is not true as the Panaganti 2022 is also model-free. To some extent, their paper considers a harder offline problem while this paper considers

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed Sensor Networks and Detection Algorithms · Reinforcement Learning in Robotics · Advanced Bandit Algorithms Research

MethodsTest · Q-Learning