TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning

Ge Li; Dong Tian; Hongyi Zhou; Xinkai Jiang; Rudolf Lioutikov; Gerhard; Neumann

arXiv:2410.09536·cs.LG·March 18, 2025

TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning

Ge Li, Dong Tian, Hongyi Zhou, Xinkai Jiang, Rudolf Lioutikov, Gerhard, Neumann

PDF

Open Access 1 Repo 3 Reviews

TL;DR

TOP-ERL introduces a transformer-based off-policy reinforcement learning algorithm that improves sample efficiency and stability in long-horizon, trajectory-based policy learning, especially in robot environments.

Contribution

It pioneers the integration of transformer architectures into off-policy ERL, enabling effective evaluation of entire action sequences for improved learning.

Findings

01

Outperforms state-of-the-art RL methods in robot environments

02

Demonstrates stable and efficient training with long action trajectories

03

Ablation studies highlight the importance of key design choices

Abstract

This work introduces Transformer-based Off-Policy Episodic Reinforcement Learning (TOP-ERL), a novel algorithm that enables off-policy updates in the ERL framework. In ERL, policies predict entire action trajectories over multiple time steps instead of single actions at every time step. These trajectories are typically parameterized by trajectory generators such as Movement Primitives (MP), allowing for smooth and efficient exploration over long horizons while capturing high-level temporal correlations. However, ERL methods are often constrained to on-policy frameworks due to the difficulty of evaluating state-action values for entire action sequences, limiting their sample efficiency and preventing the use of more efficient off-policy architectures. TOP-ERL addresses this shortcoming by segmenting long action sequences and estimating the state-action values for each segment using a…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 6Confidence 3

Strengths

This paper is easy to follow. Given that research on episodic RL is relatively limited, TOP-ERL shows its strengths in sample efficiency and overall performance.

Weaknesses

1. The code repository is incomplete, lacking the environment configuration files and key training files. Additionally, several important functions, such as ValueFunction, have not been implemented. 2. Figure 4(a) only presents the average results for the 50 tasks, without showing the individual results for each task. While I understand that the main text has space constraints, I strongly recommend including the results for all 50 tasks in the appendix. 3. The methodology seems to lack novelty

Reviewer 02Rating 8Confidence 3

Strengths

1. A critic network for action sequences using a traditional Transformer architecture is proposed to model N-step returns. 2. A SAC-like algorithm is adapted for off-policy episodic reinforcement learning using the trained critic network. 3. The proposed methodology is shown to improve sampling efficiency and also is shown to stabilize training over other baseline methods.

Weaknesses

1. In the current writing, it is hard to identify the novel parts that are proposed by the author and parts that are kept from other works. 2. Section 4.4 seems very important to understand how to adapt SAC to episodic reinforcement learning but it seems not clear. 3. Given the resemblance of the critic training to the prediction of rewards that can be used for more effective credit assignment compared to dense and sparse rewards, additional ablation studies or baselines could be discussed.

Reviewer 03Rating 8Confidence 4

Strengths

I find the paper very well written, also well structured, motivated and explained. The idea of using a transformer such that s_0 maps to V(s), a_0 maps to Q(s, a_0), a_1 maps to Q(s, a_1), ... is very natural and creative. To the best of my knowledge, it is the first approach that proposes to do it this way. I also find it interesting that V and Q can be modeled by the same unique transformer architecture without having dedicated weight just for V or just for Q. The contributions of this pape

Weaknesses

The authors claim that off-policy algorithms are often more sample efficient than on-policy counterparts. However, PPO is still often the way to go when you face a new RL problems because of better stability and the need for less hyperparameter tuning. I am not fully convinced this approach won't have the same issue. I believe the explanations of why importance sampling is not necessary could be improved.

Code & Models

Repositories

brucegeli/top_erl_iclr25
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics