TL;DR
This paper empirically investigates deep reinforcement learning algorithms in continuing tasks, highlighting their behaviors, challenges, and the effectiveness of reward-centering methods across various algorithms and large-scale environments.
Contribution
It provides the first comprehensive empirical analysis of deep RL in continuing tasks and extends reward-centering techniques to multiple algorithms and larger environments.
Findings
Reward-centering improves performance across algorithms.
Deep RL algorithms behave differently in continuing tasks.
Reward-centering outperforms other methods in large-scale environments.
Abstract
In reinforcement learning (RL), continuing tasks refer to tasks where the agent-environment interaction is ongoing and can not be broken down into episodes. These tasks are suitable when environment resets are unavailable, agent-controlled, or predefined but where all rewards-including those beyond resets-are critical. These scenarios frequently occur in real-world applications and can not be modeled by episodic tasks. While modern deep RL algorithms have been extensively studied and well understood in episodic tasks, their behavior in continuing tasks remains underexplored. To address this gap, we provide an empirical study of several well-known deep RL algorithms using a suite of continuing task testbeds based on Mujoco and Atari environments, highlighting several key insights concerning continuing tasks. Using these testbeds, we also investigate the effectiveness of a method for…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The experimental setup is, to the best of my knowledge, novel and interesting. RL in the contuining task setting is an important sub-area, with many practical applications. - The empirical study in Section 2 is extensive; they implement all of the canonical deep RL algorithms, and perform enough runs (10 per agent/environment) to establish statistical significance.
- The core methodological proposals (Section 3) represent what is, in my opinion, an incremental amendment to Naik et al.'s reward centering work. The authors show that Naik et al.'s proposals can be repurposed for the deep RL setting, but this is a logical conclusion one could draw from Naik et al.'s paper alone, and does not represent the kind of methodological novelty usually expected from a paper at ICLR. - The paper's layout feels unintuitive. Though the authors discuss some related work in
The experimentation is well done. There is a good analysis under the reward-rate metric. Diving into WHY this happens and then drawing conclusions for Swimmer and this leading to dramatic improvements is very impressive.
The whole premise of the work lies on the connection to real world tasks, primarily robotics. When simulation is not possible (for example, in very complex scenarios), then it is indeed of interest to deploy the robot in the real world and have it continually learn. However, I feel that there is some mixture between theory and practice, specifically around desired metrics to measure. When deploying a robot, I don't see the reward rate being a metric of interest. There's a task, we can measure h
- To my knowledge, learning in continuing tasks is of great practical significance while remaining under-explored in the RL community. This work proposes a promising standard benchmark for this direction. - For the experiments, three types of testbeds are proposed and popular DRL agents are used as baselines. A visual analysis of the failure patterns is provided in Figure 2.
- The writing is unsatisfactory, especially I believe there is a large room to improve the organization of the content. Since this work is mainly on presenting an empirical study, the purposes and structures of experiment designs and the key findings of experimental results are of the most importance to the audience. Here are a few suggestions: - The introduction section is a bit too long. It is now over 2 pages, which makes it difficult to convey the main thread of thoughts and findings in
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
