Learning The Minimum Action Distance
Lorenzo Steccanella, Joshua B. Evans, \"Ozg\"ur \c{S}im\c{s}ek, Anders Jonsson

TL;DR
This paper introduces a self-supervised framework to learn the minimum action distance (MAD) between states in MDPs solely from state trajectories, enabling improved goal-conditioned tasks without reward signals.
Contribution
It proposes a novel method to learn state representations based on MAD, capturing environment structure without requiring rewards or actions, applicable to various dynamics and observation noise.
Findings
Efficiently learns accurate MAD representations across diverse environments.
Outperforms existing state representation methods in quality.
Works with deterministic, stochastic, discrete, and continuous settings.
Abstract
This paper presents a state representation framework for Markov decision processes (MDPs) that can be learned solely from state trajectories, requiring neither reward signals nor the actions executed by the agent. We propose learning the minimum action distance (MAD), defined as the minimum number of actions required to transition between states, as a fundamental metric that captures the underlying structure of an environment. MAD naturally enables critical downstream tasks such as goal-conditioned reinforcement learning and reward shaping by providing a dense, geometrically meaningful measure of progress. Our self-supervised learning approach constructs an embedding space where the distances between embedded state pairs correspond to their MAD, accommodating both symmetric and asymmetric approximations. We evaluate the framework on a comprehensive suite of environments with known MAD…
Peer Reviews
Decision·Submitted to ICLR 2026
The work is largely well written and does a good job explaining their choice of quasi-metric. They also show that the manner in which they define the minimum action distance, $d_{MAD}$ leads to a unique function. They benchmark with respect to QRL.
I believe the main weakness is that the empirical evaluation is limited and for what empirical evaluation they have presented I dont see a significant gain fro this added complexity in the algorithm. In the QRL [1] paper they benchmark their results on continuous control environments and also massive 2d mazes. Here the authors have benchmarked the results on PointMaze, OGBench PointMaze, CliffWalking, KeyDoorGridWorld, and NoisyGridWorld. While I see the value in making these choices and thank
See below.
The paper states that it tries to learn a distance function between pairs of states that can later be used by an RL agent to learn more efficiently. I believe then the paper tries to learn some sort of model of the environment which can be used by the RL agent. However, there is no mention of model based reinforcement learning or even comparison against methods in model based RL. If the goal of the paper was to learn a model of the environment, why not compare against these methods and provide i
1. The definition of MAD in equation (1) conforms to the fundamental property of a type of distance. 2. It is an interesting idea to define a new distance between the states by the collected trajectories. 3. When it is hard to get the reward signals, the new framework can be seen as unsupervised or semi-supervised reinforcement learning to optimize the policy.
1. The source code of the new algorithm is not open-sourced. 2. The algorithm for MAD and the time complexity analysis should be given. 3. The experimental environments are simple and monotonous. All of them are maze problems.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Autonomous Vehicle Technology and Safety
