Multi-level meta-reinforcement learning with skill-based curriculum
Sichen Yang (Johns Hopkins University), Mauro Maggioni (Johns Hopkins University)

TL;DR
This paper introduces a multi-level meta-reinforcement learning framework that compresses hierarchical MDPs, enabling efficient policy learning, transfer of skills across tasks, and curriculum-based training for complex decision-making problems.
Contribution
It proposes an efficient hierarchical MDP compression method that preserves semantics, reduces complexity, and facilitates skill transfer and curriculum learning in reinforcement learning.
Findings
Hierarchical MDP compression reduces computational complexity.
Skills can be transferred across different tasks and levels.
Curriculum learning enhances transferability and learning efficiency.
Abstract
We consider problems in sequential decision making with natural multi-level structure, where sub-tasks are assembled together to accomplish complex goals. Systematically inferring and leveraging hierarchical structure has remained a longstanding challenge; we describe an efficient multi-level procedure for repeatedly compressing Markov decision processes (MDPs), wherein a parametric family of policies at one level is treated as single actions in the compressed MDPs at higher levels, while preserving the semantic meanings and structure of the original MDP, and mimicking the natural logic to address a complex MDP. Higher-level MDPs are themselves independent MDPs with less stochasticity, and may be solved using existing algorithms. As a byproduct, spatial or temporal scales may be coarsened at higher levels, making it more efficient to find long-term optimal policies. The multi-level…
Peer Reviews
Decision·Submitted to ICLR 2026
The work identifies a real challenge which is of interest to the community (hierarchical compositionality and efficient reuse of subskills). Also their integration of multi-level compression and skill-based curricula could, if formalised, provide an elegant lens on abstraction in RL.
* **Clarity and Writing Quality:** The paper is extremely poorly written and formatted, with unclear notation and undefined terms. E.g., - Use of $\mathcal{A}\mathcal{S}$ instead of $\mathcal{A} \times \mathcal{S}$ throughout. - Undefined $S_{1:\tau}$, $A_{0:\tau-1}$, and $R_{0,\tau}$ in the value function definition. Also the value function is only defined for initial states (line 106). - The precise definition of difficulty levels is not given, but it is used in statements like "MDP of d
The paper aims to address an important problem - improving the robust discovery of high-level skills in an environment. I agree that there is room for improvement on this line of work and if the claims of the paper are taken literally then the work does stand to be impactful and will lead to subsequent work. Given my concerns on clarity I am not able to fairly assess the originality or quality of the work and will aim to work with the authors during the discussion period to flesh out this porti
## Clarity The notation of this work is unclear and inconsistent. It is not clear if this is trying to convey subtleties in the formalism or just presenting things poorly. The bottom paragraph of page 2 serves as one example of this, where the sentence running from line 99 to 101 ("Given an active ... See App. B.1 for detailed definitions") being particularly unhelpful and confusing. This undermines the entire work unfortunately. The structure of the paper is also really unhelpful. The use of t
Strengths: **Originality:** The originality is moderate, as most of the underlying ideas, such as hierarchical abstractions, curriculum learning, and skill reuse, have been explored in prior literature. Nevertheless, the authors demonstrate good awareness of related work, providing a comprehensive contextualization, which is a positive aspect. **Quality:** Overall, the paper is well written and conceptually sound. However, it is somewhat difficult to follow due to the interleaving of methodolo
Weaknesses: **Limited empirical validation and lack of baselines**: The experimental evaluation is restricted to simple, discrete grid-world environments. While these setups illustrate the concepts clearly, they do not demonstrate scalability or generalization to more realistic domains. The paper does not provide baseline comparisons against established hierarchical or curriculum-based RL methods. **Ambiguous algorithmic implementation**: The paper lacks sufficient detail about how it can be
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Bayesian Modeling and Causal Inference
