Demystifying Linear MDPs and Novel Dynamics Aggregation Framework
Joongkyu Lee, Min-hwan Oh

TL;DR
This paper introduces a new hierarchical reinforcement learning framework with dynamics aggregation for linear MDPs, achieving improved regret bounds and providing the first provably guaranteed HRL algorithm with linear function approximation.
Contribution
The paper proves a lower bound on feature dimension in linear MDPs and proposes a novel dynamics aggregation framework with a provably efficient hierarchical RL algorithm.
Findings
Regret bound of $ ilde{O} ( d_{C6} ^{3/2} H^{3/2}\u221A{N T} )$ for the proposed algorithm.
Condition $d_{C6}^3 N \u226A d^3$ holds in many real-world hierarchical environments.
First HRL algorithm with linear function approximation that has provable guarantees.
Abstract
In this work, we prove that, in linear MDPs, the feature dimension is lower bounded by in order to aptly represent transition probabilities, where is the size of the state space and is the maximum size of directly reachable states. Hence, can still scale with depending on the direct reachability of the environment. To address this limitation of linear MDPs, we propose a novel structural aggregation framework based on dynamics, named as the "dynamics aggregation". For this newly proposed framework, we design a provably efficient hierarchical reinforcement learning algorithm in linear function approximation that leverages aggregated sub-structures. Our proposed algorithm exhibits statistical efficiency, achieving a regret of , where represents the feature dimension of aggregated subMDPs and …
Peer Reviews
Decision·ICLR 2024 poster
1. The paper questions the widely accepted belief about linear MDPs, delivering a comprehensive critique of its fundamentals. 2. The new framework, which fuses state aggregation and equivalence mapping, holds promise for addressing the limitations of linear MDPs, making it a significant contribution to the field. 3. The proposed HRL algorithm not only introduces an innovative approach to RL but is the first of its kind to provide proven guarantees in function approximation. 4. The inclusion
1. While the new algorithm excels in controlled experiments, its scalability and performance in more complex, real-world scenarios are yet to be determined. 2. While numerical experiments are conducted, this paper mentions several examples in section 4 but does not include experiments and analysis in these examples.
1. The paper shows that the dimension of the linear representation for probability transition kernel is lower bounded by |S|/|U|, where |S| is the cardinality of the states and |U| is the maximum size of directly reachable states. If |U| is not the order of |S|, the regret would depend on the state cardinality. 2. The paper develops a hierarchical linear MDP algorithm to reduce the state cardinality dependency in the final regret. It leverages the internal structure of the problem with the state
1. The paper makes stronger assumption than previous linear MDP algorithms. For the sub-structure that is explored by the paper, it assumes that the dynamic aggregation is known and has the desired boundedness in Definition 4. 2. For the final regret proven by the paper, although the regret seems to be improved in terms of the state cardinality theoretically, it also introduces another T-dependent term characterizing the aggregation gap w.r.t the original probability transition kernel. It's not
I think that (albeit simple and potentially expected) the lower bound on the feature dimension of Linear MDP is an important result for the RL theory community. The algorithm (UC-HRL) seems to be an interesting and novel contribution to hierarchical RL.
The regret bound in Theorem 2 has a linear term which can be made sublinear only if $\epsilon_P$ in Definition 4 is of order $\mathcal{O}(1 / poly(T))$. However it is not very clear from the paper how big $\epsilon_P$ can be for common choices of the approximate feature aggregation mappings $\psi$. The fact that the aggregating functions $\psi$ are required to be known in advance seems rather strong but somehow common in hierarchical RL. The discussion after Theorem 2 that justifies that $d^3_
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems
