Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data
Chongyi Zheng, Benjamin Eysenbach, Homer Walke, Patrick Yin, Kuan Fang, Ruslan Salakhutdinov, Sergey Levine

TL;DR
This paper demonstrates that contrastive self-supervised reinforcement learning can be effectively applied to real-world robotic goal-reaching tasks using only a single goal image, significantly reducing the need for manual reward engineering.
Contribution
It introduces a practical contrastive RL method for robotic control that works from offline data and single goal images, advancing self-supervised learning in robotics.
Findings
Hyperparameter tuning doubled success rates in simulation
Contrastive RL achieved real-world robotic goal reaching
Method reduces reliance on manual reward engineering
Abstract
Robotic systems that rely primarily on self-supervised learning have the potential to decrease the amount of human annotation and engineering effort required to learn control strategies. In the same way that prior robotic systems have leveraged self-supervised techniques from computer vision (CV) and natural language processing (NLP), our work builds on prior work showing that the reinforcement learning (RL) itself can be cast as a self-supervised problem: learning to reach any goal without human-specified rewards or labels. Despite the seeming appeal, little (if any) prior work has demonstrated how self-supervised RL methods can be practically deployed on robotic systems. By first studying a challenging simulated version of this task, we discover design decisions about architectures and hyperparameters that increase the success rate by . These findings lay the groundwork for…
Peer Reviews
Decision·ICLR 2024 spotlight
The proposed method is effective and cleanly applies various improves from the literature to contrastive RL to improve its performance. The goal of improve real-world performance is shown by real-world experiments. There are multiple additional experiments that provided additional insight and the reviewer found them to be useful additions. The appendix contains many useful details and experiments.
The conclusion that a deeper CNN performs worse than a shallow one, likely because of overfitting, indicates that maybe the benchmark tasks used do not align with the paper's goal of leveraging a vast amount of unlabeled data, like current approaches in computer vision and NLP. There are some missing citations of RL references for the design decisions. For instance, McCandlish et al and Bjorck et al showed that large batch training and layer normalization, respectively, are effective in RL. A
1. The paper clearly describes several tricks that greatly increase the success rate of contrastive RL agents. In particular, it is interesting that the final layer initialization trick does better than learning rate warm-up. 2. The experiments include a large number of relevant baselines, including baselines that are pre-trained on large video datasets. 3. The authors identify the “arm matching problem”, where the value function cares about the state of the robot arm but not the environment.
1. The paper demonstrates that contrastive RL is a brittle objective, since small changes in the network architecture and initialization lead to huge changes in agent performance. The paper does not study if the objective could be changed to make it more robust and less prone to overfitting. 2. It is unclear why none of the methods can learn, e.g., the “push can” policy in Figure 5. Overall, the paper does not do a good job of explaining why some of the seemingly simple tasks are difficult and
The authors use clear descriptions throughout the paper. Backgrounds, related works, the definition of the problem are all clear and sufficiently described and detailed. In previous works heuristically designed visual representations are used without detailed designs and experiments of the visual representations. The authors did a great contribution to this field that they did a detailed designs of architectures and algorithms with intensive experiments. Those experiments support their claim
The authors did great experiments with variety of tasks with intensive analysis from multiple viewpoints. More analytical descriptions in performance comparisons would contribute more to RL fields. For example, in the simulation analysis of manipulation section, they used the simple sentence analysis," perhaps because the block in that task occludes the drawer handle and introduces partial observability. ", for worse performance. Also, while pages are limited, some analysis of learned represen
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning
