TL;DR
This paper systematically studies key factors affecting batch online reinforcement learning in robotics, demonstrating that Q-functions, expressive policies, and diverse action sampling significantly improve performance and scalability.
Contribution
It provides a comprehensive empirical analysis identifying crucial elements for effective batch online RL and proposes a practical recipe for improved performance in robotic applications.
Findings
Q-functions significantly improve batch online RL performance
Implicit policy extraction outperforms traditional offline RL methods
Expressive policies and diverse action sampling enhance scalability
Abstract
The ability to learn from large batches of autonomously collected data for policy improvement -- a paradigm we refer to as batch online reinforcement learning -- holds the promise of enabling truly scalable robot learning by significantly reducing the need for human effort of data collection while getting benefits from self-improvement. Yet, despite the promise of this paradigm, it remains challenging to achieve due to algorithms not being able to learn effectively from the autonomous data. For example, prior works have applied imitation learning and filtered imitation learning methods to the batch online RL problem, but these algorithms often fail to efficiently improve from the autonomously collected data or converge quickly to a suboptimal point. This raises the question of what matters for effective batch online RL in robotics. Motivated by this question, we perform a systematic…
Peer Reviews
Decision·ICLR 2026 Poster
The topic---batch online RL---is relevant to current applications of RL. The paper is well-written. The use of color for denoting the sections is a nice touch. This paper does a good job at exploring some important algorithmic considerations for the batch online RL setting and the findings are presented clearly. Some findings are quite interesting, such as value-based methods outperforming the filtered IL after several iterations of batch training but not initially.
My main concern is the lack of certain baselines in the experimental sections. - What are the previous baseline methods for the benchmarks considered? While I see mulitple comparisons in each section to examine the impact of various choices, I do not see any explicit comparison to prior methods. While achieving the best performance is not strictly necessary, it would be helpful to have some baseline numbers to get a sense of how well the evaluated algorithm is doing. - In Fig.7, it seems like
* The paper addresses a practical and important problem. The "batch online RL" setting, which involves collecting large batches of data for offline updates, is a sensible and scalable approach for real-world robotics, reducing the need for constant human supervision and avoiding the instability of purely online updates. * The work is structured as a controlled study that ablates different components of the learning pipeline (algorithm, extraction, expressivity). This systematic approach help
* A major weakness is the lack of motivation for choosing the three specific axes of analysis. The paper repeatedly refers to these axes but never explains why these are the most critical or representative components to study. The selection of only three algorithm classes (IL, filtered-IL, and value-based RL) also feels restrictive and lacks justification, especially when hybrid methods exist. * Several of the key findings seem intuitive or are well-established principles in reinforcement le
- The paper is well-structured and easy to read. - The problem setting, batch online RL, is practical and promising. And it makes sense to decompose the problem into the proposed 3 components.
__There are missing experimental details I would like to verify further:__ - For experiments in Section 4.1, my understanding is: all three methods train a policy with the DDPM objective. The value-based RL one trains an additional Q function guiding policy rollout only. If this is the case, why does the initial performance of value-based RL differ from the IL baseline when both of them learn from $\mathcal{D}_0$? - In Section 4.2, what is the training objective when the authors apply AWR to a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
