Learning a Fast Mixing Exogenous Block MDP using a Single Trajectory
Alexander Levine, Peter Stone, Amy Zhang

TL;DR
This paper introduces STEEL, a novel algorithm that efficiently learns controllable dynamics in exogenous block MDPs from a single trajectory, with sample complexity depending on controllable space size and noise mixing time.
Contribution
STEEL is the first provably sample-efficient method for learning exogenous block MDP dynamics from a single trajectory in the function approximation setting.
Findings
STEEL achieves sample complexity depending only on controllable space size and mixing time.
STEEL is proven to be correct and sample-efficient.
Demonstrated effectiveness on toy problems.
Abstract
In order to train agents that can quickly adapt to new objectives or reward functions, efficient unsupervised representation learning in sequential decision-making environments can be important. Frameworks such as the Exogenous Block Markov Decision Process (Ex-BMDP) have been proposed to formalize this representation-learning problem (Efroni et al., 2022b). In the Ex-BMDP framework, the agent's high-dimensional observations of the environment have two latent factors: a controllable factor, which evolves deterministically within a small state space according to the agent's actions, and an exogenous factor, which represents time-correlated noise, and can be highly complex. The goal of the representation learning problem is to learn an encoder that maps from observations into the controllable latent space, as well as the dynamics of this space. Efroni et al. (2022b) has shown that this is…
Peer Reviews
Decision·ICLR 2025 Poster
- The introduction and related work highlight this work really well. It explains the existing work nicely and shows where the gaps lie and how this work attempts to extend it. - The algorithm stands out in terms of the settings it covers compared to existing work. It deals with infinite trajectories, partial observability, and optimization with function approximators all while providing sample complexity guarantees. - The algorithm itself is designed very well and has a lot of interesting featur
- Section 4 can be a bit hard to follow. To quite understand how the algorithm exactly works one has to switch between reading the section text, the pseudocode, and parts of the Appendix. I suggest moving the pseudocode to the appendix and providing further explanation of the algorithm in the main text such that the reader can get a high-level idea of how the Algorithm works from just reading section 4. - There are parts of the algorithm that are not very intuitive and might require some further
NA
NA
* The paper is clearly written, and the analysis of the key result - specifically, the sample complexity of STEEL being polynomial in the latent space size - is supported by solid mathematical arguments. The algorithm's description is intuitive and effectively conveys its core concepts. * Furthermore, representation learning from a single episode has been a long-standing interest in the RL community, making this paper's contribution highly relevant to the field. * The paper provides a comprehens
* The method relies on several assumptions, particularly concerning the latent state space $\mathcal{S}$. For example, the assumptions of deterministic latent dynamics and the reachability condition of the latent state space are critical for STEEL's CycleFind to function. Addressing these assumptions seems non-trivial, and overcoming them is posed as future work. * Although the sample complexity of STEEL is polynomial in the size of the latent state space, the numerical simulations show that a s
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Algorithms · Algorithms and Data Compression · Speech Recognition and Synthesis
