Episodic Novelty Through Temporal Distance

Yuhua Jiang; Qihan Liu; Yiqin Yang; Xiaoteng Ma; Dianyu Zhong; Hao Hu,; Jun Yang; Bin Liang; Bo Xu; Chongjie Zhang; Qianchuan Zhao

arXiv:2501.15418·cs.LG·January 28, 2025

Episodic Novelty Through Temporal Distance

Yuhua Jiang, Qihan Liu, Yiqin Yang, Xiaoteng Ma, Dianyu Zhong, Hao Hu,, Jun Yang, Bin Liang, Bo Xu, Chongjie Zhang, Qianchuan Zhao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ETD, a new method using temporal distance and contrastive learning to improve exploration in sparse reward environments with varying contexts, outperforming existing approaches.

Contribution

The paper presents a novel temporal distance metric and contrastive learning framework for intrinsic motivation in CMDPs, addressing limitations of count-based and similarity-based methods.

Findings

01

ETD outperforms state-of-the-art exploration methods on benchmark tasks.

02

Temporal distance effectively captures state novelty in sparse reward environments.

03

Contrastive learning enhances the accuracy of temporal distance estimation.

Abstract

Exploration in sparse reward environments remains a significant challenge in reinforcement learning, particularly in Contextual Markov Decision Processes (CMDPs), where environments differ across episodes. Existing episodic intrinsic motivation methods for CMDPs primarily rely on count-based approaches, which are ineffective in large state spaces, or on similarity-based methods that lack appropriate metrics for state comparison. To address these shortcomings, we propose Episodic Novelty Through Temporal Distance (ETD), a novel approach that introduces temporal distance as a robust metric for state similarity and intrinsic reward computation. By employing contrastive learning, ETD accurately estimates temporal distances and derives intrinsic rewards based on the novelty of states within the current episode. Extensive experiments on various benchmark tasks demonstrate that ETD…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. The paper introduces a novel method to use temporal distance as a (quasi)metric for state similarity. 2. The paper conducts extensive experiments across multiple CMDP environments, comparing ETD to several baseline methods. 3. ETD + PPO demonstrates robust performance improvements, especially in challenging sparse reward scenarios. 4. Results on extensive experiments across multiple CMDP environments, comparing ETD to several baseline methods have been reported. 5. The paper is well-structure

Weaknesses

1. The proposed ETD doesn’t take into consideration extrinsic rewards to compute similarity. Intuitively, states with similar rewards could be considered similar in terms of the task objective [1]. 2. The approach has been primarily tested on discrete action spaces, and its effectiveness in continuous action domains such as MuJoCo [2], DeepMind Control Suite [3], or Fetch [4] environments remains unexplored. [1] Agarwal, Rishabh, et. al. “Contrastive behavioral similarity embeddings for general

Reviewer 02Rating 6Confidence 3

Strengths

- The paper is well-written, with a clear and cohesive narrative. Most technical details are effectively conveyed through illustrative figures and results from intuitive toy tasks. - The experimental tasks are appropriately chosen, providing sufficient complexity to evaluate the approach. - The experimental results, along with ablation studies, clearly demonstrate the advantages of the proposed method over other baselines. - The paper offers a comprehensive analysis and comparison of different t

Weaknesses

## MDP assumption The definition of the intrinsic bonus reward violates the MDP assumption. In section 2, the total reward $r(s_t, a_t, s_{t+1})$ is a decomposed as the environment reward $r_t^e$ plus the weighted bonus $\beta b_t$. However, $b_t$ is a function that depends on the visited states within the episode, which disrupts the definition of total reward and violates the MDP assumption. In this case, the visited states influences the intrinsic reward, potentially harming the policy built u

Reviewer 03Rating 8Confidence 4

Strengths

- A well motivated new intrinsic reward signal for RL - The concept is simple and provides a strong performance - Exhaustive evaluations, good comparisons the baseline methods as well as nice ablations - Good insights why baseline methods fail

Weaknesses

the paper is already of high quality and I could not find major weaknesses. Minor weaknesses are given in the questions section.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods