Transitive RL: Value Learning via Divide and Conquer

Seohong Park; Aditya Oberai; Pranav Atreya; Sergey Levine

arXiv:2510.22512·cs.LG·February 24, 2026

Transitive RL: Value Learning via Divide and Conquer

Seohong Park, Aditya Oberai, Pranav Atreya, Sergey Levine

PDF

3 Reviews

TL;DR

Transitive Reinforcement Learning (TRL) introduces a divide-and-conquer approach for value learning in offline goal-conditioned RL, reducing bias and variance issues and excelling in long-horizon tasks.

Contribution

The paper proposes TRL, a novel value learning algorithm leveraging triangle inequality structure for improved bias and variance management in offline GCRL.

Findings

01

TRL outperforms previous algorithms on long-horizon benchmarks.

02

TRL requires fewer recursive updates than TD methods.

03

Experimental results demonstrate superior performance in challenging tasks.

Abstract

In this work, we present Transitive Reinforcement Learning (TRL), a new value learning algorithm based on a divide-and-conquer paradigm. TRL is designed for offline goal-conditioned reinforcement learning (GCRL) problems, where the aim is to find a policy that can reach any state from any other state in the smallest number of steps. TRL converts a triangle inequality structure present in GCRL into a practical divide-and-conquer value update rule. This has several advantages compared to alternative value learning paradigms. Compared to temporal difference (TD) methods, TRL suffers less from bias accumulation, as in principle it only requires $O (lo g T)$ recursions (as opposed to $O (T)$ in TD learning) to handle a length- $T$ trajectory. Unlike Monte Carlo methods, TRL suffers less from high variance as it performs dynamic programming. Experimentally, we show that TRL achieves the best…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper is well written and easy to follow. 2. The proposed idea is novel and interesting. I agree with the authors that this paper is a first step towards a promising direction.

Weaknesses

1. The authors restrict the discussion to discrete, deterministic environments with trajectory data of equal lengths, although they claim that their proposal can be extended to continuous, stochastic environments and various-length trajectories. I suggest the authors actually do such extensions and present the extended version. 2. The tasks used in the experiments are synthetic without clear real-life purposes. 3. Appendix A is unnecessary. It's just homework-level math.

Reviewer 02Rating 6Confidence 4

Strengths

1. TRL’s key strength lies in its scalability to long-horizon tasks (up to 4000 steps), where it consistently outperforms or matches the best TD- and MC-based baselines. By reducing the Bellman recursion depth to logarithmic complexity, TRL fundamentally mitigates the bias accumulation problem that plagues TD methods over long trajectories. 2. In contrast to TD-n approaches, TRL attains superior performance without the need for laborious, task-specific tuning of the horizon parameter 𝑛, offering

Weaknesses

1. The proposed approach fundamentally depends on deterministic environment dynamics for the triangle inequality assumption to hold. Extending the framework to learn unbiased value functions in stochastic environments remains an important and open avenue for future research. 2. As shown in the ablation study, the results are highly sensitive to subgoal selection, which may introduce additional instability when applied to stochastic or noisy environments.

Reviewer 03Rating 4Confidence 3

Strengths

The idea seems sensible and the results seem strong. It seems to make the idea of triangle inequality for value learning work, although I am not very familiar with current GCRL literature.

Weaknesses

**W1.** Complicated algorithm with lots of moving parts: eta-quantile, M subgoals, expectable loss, reweighting, separate "oracle distillation" network, policy extraction. This is many more "moving parts" than standard algorithms like IQL or TD-n, which could make it difficult to tune and implement. **W2.** As far as I understand it relies on oracle representations (see Appendix D.2). This seems like a significant weakness. **W3.** Structure: The triangle inequality is referred to in the abstr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.