Contrastive Difference Predictive Coding
Chongyi Zheng, Ruslan Salakhutdinov, Benjamin Eysenbach

TL;DR
This paper introduces a temporal difference contrastive predictive coding method that improves data efficiency in learning long-term dependencies for time-series prediction and goal-conditioned reinforcement learning.
Contribution
It presents a novel temporal difference approach to contrastive predictive coding, reducing data requirements for learning long-term dependencies in time series.
Findings
Achieves 2x median success rate improvement in goal-conditioned RL
Outperforms prior methods in stochastic environments
Significantly more sample efficient in tabular settings
Abstract
Predicting and reasoning about the future lie at the heart of many time-series questions. For example, goal-conditioned reinforcement learning can be viewed as learning representations to predict which states are likely to be visited in the future. While prior methods have used contrastive predictive coding to model time series data, learning representations that encode long-term dependencies usually requires large amounts of data. In this paper, we introduce a temporal difference version of contrastive predictive coding that stitches together pieces of different time series data to decrease the amount of data required to learn predictions of future events. We apply this representation learning method to derive an off-policy algorithm for goal-conditioned RL. Experiments demonstrate that, compared with prior RL methods, ours achieves median improvement in success rates and…
Peer Reviews
Decision·ICLR 2024 poster
- The paper proposes a new temporal difference (TD) estimator for the InfoNCE loss, which is shown to be more efficient than the standard (Monte Carlo) estimator. - The proposed goal-conditioned reinforcement learning (RL) algorithm outperforms prior methods in both online and offline settings. - The proposed algorithm is capable of handling stochasticity in the environment dynamics. - In stochastic tasks, there is an excellent improvement in performance versus the baseline of Quasimetric RL,
- The paper focuses on fairly trivial environments, it would be nice to see these methods working on more challenging higher dimensional goal conditioned RL tasks, as its not a given that these gains will carry over to tasks that matter a lot more. - The proposed TD estimator is more complex than the standard (Monte Carlo) estimator and its implementation requires more hyperparameters. - The performance of the proposed goal-conditioned RL algorithm on the most challenging tasks was less than 5
The derived method fits nicely within the literature and seems to fill a nice gap between contrastive objectives from self-supervised objectives and more online focused temporal-difference updates.
After the rebuttal, I'm raising my score. While I believe there are issues with empirical section still, these are issues the rest of the literature are also facing. I don't think rejecting this paper is a way to a solution. I also appreciate the fix to some of the inaccurate statements that were overlooked! Great job authors! --------before edit------- This paper struggles with clarity and accuracy in some of the ancillary statements made about the literature surrounding the paper and in th
The paper is mostly well written, apart from some details (see questions section). The derivations are sound. Experimental results show strong performance comparing to previous methods. The paper also presents some analysis and insights to explain the performance.
The novelty is slightly limited. The idea of using InfoNCE to estimate the state occupancy measure has been presented in contrastive RL; the Bellman-like update and the use of importance weight has been presented in C-Learning.
Code & Models
Videos
Taxonomy
TopicsData Stream Mining Techniques · Mental Health Research Topics · Reinforcement Learning in Robotics
MethodsInfoNCE · Contrastive Predictive Coding
