Demystifying Linear MDPs and Novel Dynamics Aggregation Framework

Joongkyu Lee; Min-hwan Oh

arXiv:2410.24089·stat.ML·November 1, 2024

Demystifying Linear MDPs and Novel Dynamics Aggregation Framework

Joongkyu Lee, Min-hwan Oh

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new hierarchical reinforcement learning framework with dynamics aggregation for linear MDPs, achieving improved regret bounds and providing the first provably guaranteed HRL algorithm with linear function approximation.

Contribution

The paper proves a lower bound on feature dimension in linear MDPs and proposes a novel dynamics aggregation framework with a provably efficient hierarchical RL algorithm.

Findings

01

Regret bound of $ ilde{O} ( d_{C6} ^{3/2} H^{3/2}\u221A{N T} )$ for the proposed algorithm.

02

Condition $d_{C6}^3 N \u226A d^3$ holds in many real-world hierarchical environments.

03

First HRL algorithm with linear function approximation that has provable guarantees.

Abstract

In this work, we prove that, in linear MDPs, the feature dimension $d$ is lower bounded by $S / U$ in order to aptly represent transition probabilities, where $S$ is the size of the state space and $U$ is the maximum size of directly reachable states. Hence, $d$ can still scale with $S$ depending on the direct reachability of the environment. To address this limitation of linear MDPs, we propose a novel structural aggregation framework based on dynamics, named as the "dynamics aggregation". For this newly proposed framework, we design a provably efficient hierarchical reinforcement learning algorithm in linear function approximation that leverages aggregated sub-structures. Our proposed algorithm exhibits statistical efficiency, achieving a regret of $\tilde{O} (d_{ψ}^{3/2} H^{3/2} N T)$ , where $d_{ψ}$ represents the feature dimension of aggregated subMDPs and $N$ …

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The paper questions the widely accepted belief about linear MDPs, delivering a comprehensive critique of its fundamentals. 2. The new framework, which fuses state aggregation and equivalence mapping, holds promise for addressing the limitations of linear MDPs, making it a significant contribution to the field. 3. The proposed HRL algorithm not only introduces an innovative approach to RL but is the first of its kind to provide proven guarantees in function approximation. 4. The inclusion

Weaknesses

1. While the new algorithm excels in controlled experiments, its scalability and performance in more complex, real-world scenarios are yet to be determined. 2. While numerical experiments are conducted, this paper mentions several examples in section 4 but does not include experiments and analysis in these examples.

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1. The paper shows that the dimension of the linear representation for probability transition kernel is lower bounded by |S|/|U|, where |S| is the cardinality of the states and |U| is the maximum size of directly reachable states. If |U| is not the order of |S|, the regret would depend on the state cardinality. 2. The paper develops a hierarchical linear MDP algorithm to reduce the state cardinality dependency in the final regret. It leverages the internal structure of the problem with the state

Weaknesses

1. The paper makes stronger assumption than previous linear MDP algorithms. For the sub-structure that is explored by the paper, it assumes that the dynamic aggregation is known and has the desired boundedness in Definition 4. 2. For the final regret proven by the paper, although the regret seems to be improved in terms of the state cardinality theoretically, it also introduces another T-dependent term characterizing the aggregation gap w.r.t the original probability transition kernel. It's not

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

I think that (albeit simple and potentially expected) the lower bound on the feature dimension of Linear MDP is an important result for the RL theory community. The algorithm (UC-HRL) seems to be an interesting and novel contribution to hierarchical RL.

Weaknesses

The regret bound in Theorem 2 has a linear term which can be made sublinear only if $\epsilon_P$ in Definition 4 is of order $\mathcal{O}(1 / poly(T))$. However it is not very clear from the paper how big $\epsilon_P$ can be for common choices of the approximate feature aggregation mappings $\psi$. The fact that the aggregating functions $\psi$ are required to be known in advance seems rather strong but somehow common in hierarchical RL. The discussion after Theorem 2 that justifies that $d^3_

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems