Learning The Minimum Action Distance

Lorenzo Steccanella; Joshua B. Evans; \"Ozg\"ur \c{S}im\c{s}ek; Anders Jonsson

arXiv:2506.09276·cs.LG·March 25, 2026

Learning The Minimum Action Distance

Lorenzo Steccanella, Joshua B. Evans, \"Ozg\"ur \c{S}im\c{s}ek, Anders Jonsson

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a self-supervised framework to learn the minimum action distance (MAD) between states in MDPs solely from state trajectories, enabling improved goal-conditioned tasks without reward signals.

Contribution

It proposes a novel method to learn state representations based on MAD, capturing environment structure without requiring rewards or actions, applicable to various dynamics and observation noise.

Findings

01

Efficiently learns accurate MAD representations across diverse environments.

02

Outperforms existing state representation methods in quality.

03

Works with deterministic, stochastic, discrete, and continuous settings.

Abstract

This paper presents a state representation framework for Markov decision processes (MDPs) that can be learned solely from state trajectories, requiring neither reward signals nor the actions executed by the agent. We propose learning the minimum action distance (MAD), defined as the minimum number of actions required to transition between states, as a fundamental metric that captures the underlying structure of an environment. MAD naturally enables critical downstream tasks such as goal-conditioned reinforcement learning and reward shaping by providing a dense, geometrically meaningful measure of progress. Our self-supervised learning approach constructs an embedding space where the distances between embedded state pairs correspond to their MAD, accommodating both symmetric and asymmetric approximations. We evaluate the framework on a comprehensive suite of environments with known MAD…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The work is largely well written and does a good job explaining their choice of quasi-metric. They also show that the manner in which they define the minimum action distance, $d_{MAD}$ leads to a unique function. They benchmark with respect to QRL.

Weaknesses

I believe the main weakness is that the empirical evaluation is limited and for what empirical evaluation they have presented I dont see a significant gain fro this added complexity in the algorithm. In the QRL [1] paper they benchmark their results on continuous control environments and also massive 2d mazes. Here the authors have benchmarked the results on PointMaze, OGBench PointMaze, CliffWalking, KeyDoorGridWorld, and NoisyGridWorld. While I see the value in making these choices and thank

Reviewer 02Rating 2Confidence 4

Strengths

See below.

Weaknesses

The paper states that it tries to learn a distance function between pairs of states that can later be used by an RL agent to learn more efficiently. I believe then the paper tries to learn some sort of model of the environment which can be used by the RL agent. However, there is no mention of model based reinforcement learning or even comparison against methods in model based RL. If the goal of the paper was to learn a model of the environment, why not compare against these methods and provide i

Reviewer 03Rating 6Confidence 3

Strengths

1. The definition of MAD in equation (1) conforms to the fundamental property of a type of distance. 2. It is an interesting idea to define a new distance between the states by the collected trajectories. 3. When it is hard to get the reward signals, the new framework can be seen as unsupervised or semi-supervised reinforcement learning to optimize the policy.

Weaknesses

1. The source code of the new algorithm is not open-sourced. 2. The algorithm for MAD and the time complexity analysis should be given. 3. The experimental environments are simple and monotonous. All of them are maze problems.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Autonomous Vehicle Technology and Safety