Skill Learning via Policy Diversity Yields Identifiable Representations for Reinforcement Learning

Patrik Reizinger; B\'alint Mucs\'anyi; Siyuan Guo; Benjamin Eysenbach; Bernhard Sch\"olkopf; Wieland Brendel

arXiv:2507.14748·cs.LG·July 22, 2025

Skill Learning via Policy Diversity Yields Identifiable Representations for Reinforcement Learning

Patrik Reizinger, B\'alint Mucs\'anyi, Siyuan Guo, Benjamin Eysenbach, Bernhard Sch\"olkopf, Wieland Brendel

PDF

3 Reviews

TL;DR

This paper demonstrates that the Contrastive Successor Features method can theoretically recover true environment features in reinforcement learning, providing the first identifiability guarantee and empirical validation for feature learning from states and pixels.

Contribution

It offers the first theoretical proof that CSF can recover ground-truth features up to a linear transformation in RL, linking mutual information objectives to identifiable representations.

Findings

01

CSF provably recovers ground-truth features from states.

02

Empirical validation in MuJoCo and DeepMind Control environments.

03

Analysis of mutual information objectives and entropy regularizers.

Abstract

Self-supervised feature learning and pretraining methods in reinforcement learning (RL) often rely on information-theoretic principles, termed mutual information skill learning (MISL). These methods aim to learn a representation of the environment while also incentivizing exploration thereof. However, the role of the representation and mutual information parametrization in MISL is not yet well understood theoretically. Our work investigates MISL through the lens of identifiable representation learning by focusing on the Contrastive Successor Features (CSF) method. We prove that CSF can provably recover the environment's ground-truth features up to a linear transformation due to the inner product parametrization of the features and skill diversity in a discriminative sense. This first identifiability guarantee for representation learning in RL also helps explain the implications of…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 2

Strengths

- The paper makes a novel connection between MISL methods and non-linear ICA. - The paper proves identifiability results for CSF, providing insight into why the method succeeds and why certain design choices are better. - The paper provides empirical results to back up the theoretical insights in a number of environments. - The paper opens up a new direction of research for understanding self-supervised RL methods, and can be of wide interest to the RL community.

Weaknesses

- The paper is dense and can sometimes be hard to follow. - For example, in Section 2.3, where the paper draws a connection between MISL and DGP, it was initially unclear to me how skills fit into the picture. It was mentioned in Section 2.1 that skills can be viewed as auxiliary variables, which can be brought up here again to aid explanation. - Perhaps due to a limitation in space, there's almost no spacing between some paragraphs. - There can be more discussion on the technical details of

Reviewer 02Rating 8Confidence 4

Strengths

This work applies an elegant description of identificabiltiy in a novel context. The theoretical framework is well articulated and provides clear reasons advantages. The empirical results are sufficient to support the theoretical claims

Weaknesses

The empirical results are somewhat limited in scope, considering the extension to the POMDP setting The work is not particularly self contained, in that many of the claims are fully described in the appendix. CSF is not the most representative MISL algorithm because it detaches teh representation learning from the policy learning more than most methods.

Reviewer 03Rating 4Confidence 4

Strengths

This paper is well motivated and well written. Its objectives are clear and, to my knowledge, provide the first analysis of ground truth identifiability using mutual information skill learning (MISL) losses. The paper introduces the notion that a set of diverse skills and an inner product parameterization are necessary for learning a robust representation that provably recovers the ground truth state.

Weaknesses

There are several weaknesses that exist are present in the paper that must be addressed. ## Major Weaknesses 1. **Reality of assumptions**: It is not clear that the assumptions made in the paper are representative of reality. Namely, is it common that "each state difference is equiprobable"? What is the support for this claim? 2. **Transitions are typically not skills**: The authors also assume that "each pair of consecutive states is a skill". I believe that this is not a typical definition o

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.