Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals

Octavio Pappalardo

arXiv:2601.19810·cs.LG·January 28, 2026

Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals

Octavio Pappalardo

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ULEE, an unsupervised meta-learning approach that enhances exploration and adaptation in reinforcement learning agents by self-generating goals, leading to better zero-shot and few-shot performance in complex environments.

Contribution

The paper proposes ULEE, a novel unsupervised meta-learning method that combines goal generation and curriculum guidance to improve exploration and adaptation in diverse tasks.

Findings

01

ULEE improves zero-shot and few-shot performance.

02

Pre-training with ULEE enhances generalization to new objectives.

03

Outperforms DIAYN and other curricula in benchmarks.

Abstract

Unsupervised pre-training can equip reinforcement learning agents with prior knowledge and accelerate learning in downstream tasks. A promising direction, grounded in human development, investigates agents that learn by setting and pursuing their own goals. The core challenge lies in how to effectively generate, select, and learn from such goals. Our focus is on broad distributions of downstream tasks where solving every task zero-shot is infeasible. Such settings naturally arise when the target tasks lie outside of the pre-training distribution or when their identities are unknown to the agent. In this work, we (i) optimize for efficient multi-episode exploration and adaptation within a meta-learning framework, and (ii) guide the training curriculum with evolving estimates of the agent's post-adaptation performance. We present ULEE, an unsupervised meta-learning method that combines an…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The paper is well-motivated. The curriculum is based on performance after in-context adaptation, not immediate performance, which aligns with the intended meta-RL setting. Empirically, ULEE improves exploration, shows faster few-shot adaptation, and provides stronger initializations for finetuning.

Weaknesses

* The empirical impact of defining difficulty via post-adaptation performance, rather than immediate performance, remains unclear without an ablation. A direct ablation (e.g., a sensitivity study over $K$) would strengthen the paper. * The baselines do not include recent meta RL and unsupervised RL methods. * Experimental scope is limited to grid-world domains.

Reviewer 02Rating 4Confidence 3

Strengths

- Although this is not my area of expertise, the paper's motivation and positioning within existing literature appears strong. - The problem of pre-training for adaptation is very interesting. - The empirical results are very strong.

Weaknesses

- Section 4.3.1 does not sufficiently answer Q1. In this section the fraction of evaluation goals reached as a function of the number of evaluation episodes is shown in Figure 2. In my opinion this does not isolate exploration as the cause of evaluation goals reached, nor does it answer _"what exploration capabilities"_ the policy exhibits. For example, an increase in evaluation goals reached can also be due to zero-shot generalization, rather than improved exploration/adaptation. - Some parts

Reviewer 03Rating 6Confidence 4

Strengths

1. The post-adaptation task-difficulty metric, for me, is novel. It significantly departs from prior works in automatic curricula, which typically evaluate goal difficulty based on the agent's immediate performance. By defining difficulty as the agent's expected success rate after an adaptation budget, the method directly optimizes for the agent's capacity to learn rather than just its current knowledge. This aligns the pre-training objective more closely. 2. The paper originally combines three

Weaknesses

1. The primary weakness of ULEE is its high methodological complexity. The system is not a single algorithm but a complex interplay of four distinct, learning-based components: the Pre-trained Policy ($\pi$), the Goal-search Policy ($\pi_{g, s}$), as well as the Difficulty Predictor. Are there practical bottlenecks (e.g., memory, wall-clock time) for this method? 2. The overall system's success depends on these components learning in lockstep. Can this co-adaptive process be brittle? E.g., will

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics · Adversarial Robustness in Machine Learning