Is Optimal Transport Necessary for Inverse Reinforcement Learning?

Zixuan Dong; Yumi Omori; Keith Ross

arXiv:2506.06793·cs.LG·June 10, 2025

Is Optimal Transport Necessary for Inverse Reinforcement Learning?

Zixuan Dong, Yumi Omori, Keith Ross

PDF

Open Access 3 Reviews

TL;DR

This paper questions the necessity of optimal transport in inverse reinforcement learning, proposing simple, heuristic alternatives that match or outperform OT-based methods across various benchmarks, emphasizing simplicity and efficiency.

Contribution

The authors introduce two heuristic IRL methods that bypass optimal transport, demonstrating their effectiveness and efficiency through extensive empirical evaluation.

Findings

01

Heuristic methods match or outperform OT-based approaches.

02

Simple proximity-based rewards are effective for IRL.

03

Complex OT optimization may be unnecessary for reward inference.

Abstract

Inverse Reinforcement Learning (IRL) aims to recover a reward function from expert demonstrations. Recently, Optimal Transport (OT) methods have been successfully deployed to align trajectories and infer rewards. While OT-based methods have shown strong empirical results, they introduce algorithmic complexity, hyperparameter sensitivity, and require solving the OT optimization problems. In this work, we challenge the necessity of OT in IRL by proposing two simple, heuristic alternatives: (1) Minimum-Distance Reward, which assigns rewards based on the nearest expert state regardless of temporal order; and (2) Segment-Matching Reward, which incorporates lightweight temporal alignment by matching agent states to corresponding segments in the expert trajectory. These methods avoid optimization, exhibit linear-time complexity, and are easy to implement. Through extensive evaluations across…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. This work proposes two optimization-free reward surrogates (Min-Dist, Seg-Match) with clear computational advantages; 2. Provide a practical complexity and timing comparison for reward labeling; 3. Empirical study shows that simple distance-based surrogates are competitive or better on many standard setups.

Weaknesses

1. The paper does not test harder cases where OT variants are known to be useful (e.g., large time warps, cross-domain state geometry, partial observability, goal shifts). As the authors themselves highlight, outcomes are highly sensitive to downstream RL hyperparameters (γ, BC regularizations). Hence, the claim “OT is unnecessary” should be re-scoped to the studied regimes. The paper already shows tuning can flip results (e.g., Seg-Match collapse under untuned regularization vs. recovery after

Reviewer 02Rating 2Confidence 4

Strengths

1. Simplicity of the method. The method itself does not require complex algorithms and is purely based on trajectory alignment via distance. 2. Temporal alignment. The method by design includes the property of temporal alignment, allowing trajectories to encode context information, which is crucial for obtaining a reward aligned with the goal. 3. Multiple datasets and mixture of experts evaluation. The authors evaluated their algorithm across multiple datasets and demonstrated its performance in

Weaknesses

1. **Incremental contribution:** The proposed method appears to offer only a modest extension of existing approaches and does not achieve state-of-the-art performance. This raises questions regarding the practical significance and broader applicability of the method. 2. **Incomplete comparison with prior work:** The authors claim superior performance in most cases; however, they do not include a comparison with [1], which also leverages a single expert trajectory and consistently achieves bette

Reviewer 03Rating 2Confidence 4

Strengths

1) The motivation is clear, and the paper is well-written and easy to follow. 2) The proposed approaches are conceptually simple and perform competitively with more complex adversarial OT methods.

Weaknesses

1) The empirical support for the proposed methods is inconsistent across benchmarks. On MuJoCo, Segment-Matching performs best, while on Antmaze, Minimal-Distance is superior. This variability limits the methods' applicability to more complex benchmarks, as it necessitates training and comparing both approaches to determine the best one. 2) The paper lacks a discussion on why more complex min-max OT solutions are not more robust for IRL, resulting in only marginal performance gains compared to t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Autonomous Vehicle Technology and Safety · Robot Manipulation and Learning