TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design

Geonwoo Cho; Jaegyun Im; Jihwan Lee; Hojun Yi; Sejin Kim; Sundong Kim

arXiv:2506.19997·cs.LG·March 17, 2026

TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design

Geonwoo Cho, Jaegyun Im, Jihwan Lee, Hojun Yi, Sejin Kim, Sundong Kim

PDF

Open Access 1 Repo 3 Reviews

TL;DR

TRACED introduces a novel approach to environment curriculum design in reinforcement learning by combining transition-prediction error and Co-Learnability to improve zero-shot generalization and sample efficiency.

Contribution

It proposes a new regret approximation method that incorporates transition-prediction error and Co-Learnability for more effective environment curriculum generation in UED.

Findings

01

TRACED outperforms strong baselines on multiple benchmarks.

02

Transition-prediction error accelerates complexity ramp-up.

03

Co-Learnability provides additional gains when combined with transition-prediction error.

Abstract

Generalizing deep reinforcement learning agents to unseen environments remains a significant challenge. One promising solution is Unsupervised Environment Design (UED), a co-evolutionary framework in which a teacher adaptively generates tasks with high learning potential, while a student learns a robust policy from this evolving curriculum. Existing UED methods typically measure learning potential via regret, the gap between optimal and current performance, approximated solely by value-function loss. Building on these approaches, we introduce the transition-prediction error as an additional term in our regret approximation. To capture how training on one task affects performance on others, we further propose a lightweight metric called Co-Learnability. By combining these two measures, we present Transition-aware Regret Approximation with Co-learnability for Environment Design (TRACED).…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. Overall, the writing is easy to follow and straightforward. 2. The transition-aware regret approximation is well motivated. 3. The notation is simple, and the definitions (Section 3.2) are easy to follow. 4. The figures are mainly well formatted and readable. 5. The authors provide fair computational comparisons and a wall-clock time comparison with most existing methods.

Weaknesses

1. The authors do not rely on novel analysis tools as proposed by SFL [1], which would allow the authors to strengthen their claim that TRACED improves performance over existing methods. For example, the authors could have analysed CVaR to show robustness to worst-case levels. Similarly, SFL provides density maps, such as those in Figures 4 and 6 of the SFL paper, which would strengthen the claim of TRACED that they improve performance. It is also unclear to the reviewer why they do not compare

Reviewer 02Rating 6Confidence 3

Strengths

By introducing an intuitive transition-prediction loss and co-learnability metric, the method yields strong empirical gains with inexpensive changes.

Weaknesses

* While co-learnability is interesting and intuitive, the paper lacks theoretical analysis or guarantees. * Despite its stochastic dynamics formulation, ATPL is computed based on deterministic predictions. For state–action pairs with high aleatoric stochasticity, the next state is inherently unpredictable, so ATPL can be large even when the future-value gap is negligible.

Reviewer 03Rating 6Confidence 4

Strengths

The paper is generally well written and provides a solid review of related work. Proposes a novel and computationally efficient curriculum strategy for UED that achieves competitive performance without additional overhead. Includes systematic empirical evaluation, with detailed ablation studies and reporting of wall-clock time.

Weaknesses

Clarity of Section 3.2 could be improved: - co-learnability term in Eq. (7) requires one-step lookahead in the curriculum - task-difficulty $(i, t)$ defined as the regret approximation of the task $i$ when last sampled before time $t$ - co-learnability $(i, t)$ computed when task $i$ was last drawn before time $t-1$ - Please clarify why the rank transform prevents outliers from dominating the sampling distribution. Missing discussion of ZPD-based curriculum strategies (e.g., Florensa et al., 2

Code & Models

Repositories

cho-geonwoo/traced
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Graphics and Visualization Techniques · Neural Networks and Applications · Generative Adversarial Networks and Image Synthesis