Which Experiences Are Influential for RL Agents? Efficiently Estimating The Influence of Experiences

Takuya Hiraoka; Guanquan Wang; Takashi Onishi; Yoshimasa Tsuruoka

arXiv:2405.14629·cs.LG·July 22, 2025

Which Experiences Are Influential for RL Agents? Efficiently Estimating The Influence of Experiences

Takuya Hiraoka, Guanquan Wang, Takashi Onishi, Yoshimasa Tsuruoka

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper introduces PIToD, an efficient method for estimating the influence of experiences in reinforcement learning, enabling the removal of negatively influential experiences to improve agent performance.

Contribution

We propose PIToD, a novel, computationally efficient approach for influence estimation in RL, facilitating experience-based performance enhancement.

Findings

01

PIToD accurately estimates experience influence compared to LOO.

02

Using PIToD to remove negatively influential experiences improves RL performance.

03

PIToD is more computationally efficient than traditional influence estimation methods.

Abstract

In reinforcement learning (RL) with experience replay, experiences stored in a replay buffer influence the RL agent's performance. Information about how these experiences influence the agent's performance is valuable for various purposes, such as identifying experiences that negatively influence underperforming agents. One method for estimating the influence of experiences is the leave-one-out (LOO) method. However, this method is usually computationally prohibitive. In this paper, we present Policy Iteration with Turn-over Dropout (PIToD), which efficiently estimates the influence of experiences. We evaluate how correctly PIToD estimates the influence of experiences and its efficiency compared to LOO. We then apply PIToD to amend underperforming RL agents, i.e., we use PIToD to estimate negatively influential experiences for the RL agents and to delete the influence of these…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

- This paper focus on experience replay, that is, sampling distribution manipulation, which I think is a under-represented direction in RL research. - I like the the fact that ToD being applied in the sample consideration, the usage of ToD feels natural and justified for this use case.

Weaknesses

- The paper do not have a theoretical underpinning for their approach though some reader may find the idea intuitive, that said, personally, I'm not a fan of removing samples from experience buffers; - because the way I see it, deleting "negatively influential experiences" seems to be a lenient/hysteresis update commonly seen in optimistic approaches, it may be useful in same case but should be used with caution since it may hinder learning and cause bias; a comparison with similar approaches is

Reviewer 02Rating 3Confidence 3

Strengths

1. A novel approach is presented for estimating the influence of specific experiences in RL by using Turn-over Dropout (ToD). 2. It is theoretically demonstrated that the complement mask $w\_i$ for experience $e\_i$ indicates an absence of influence from $e\_i$.

Weaknesses

1. The metric used to show self-influence appears inappropriate. As noted in L246, $Q\_{w\_i}$ is in a state where it has not been trained on $e\_i$, so $L\_{pe,i}(Q\_{w\_i}) − L\_{pe,i}(Q\_{m\_i})$ is expected to be greater than zero in most cases, regardless of whether $e\_i$ is beneficial for learning. Additionally, since $\pi\_{m\_i} = \arg\max\_{\pi} L\_{pi,i}(\pi)$, $L\_{pi,i}(\pi\_{w\_i}) − L\_{pi,i}(\pi\_{m\_i})$ is likely always less than zero. Thus, these metrics do not directly indic

Reviewer 03Rating 3Confidence 3

Strengths

Being able to efficiently estimate the affect a particular data point has on a network used to estimate $\pi$ or $Q$ is a very powerful tool, and to my knowledge this machinery has not been applied to a Deep RL setting before. I like that the authors have aimed to evaluate it for a variety of different purposes. However, it seems that only Section G in the appendix, and a short paragraph at the end of section 6 actually attempt to answer the question in the title of the paper.

Weaknesses

# Major More explanation in Section 4 of the PIToD method would be very helpful for a reader's understanding. "Thus, some readers may suspect that the parameters dropped out by $m_i$ (i.e., the parameters obtained by applying $w_i$) are not influenced by $e_i$." - This sentence caused a lot of confusion to me when reading the paper. The phrasing seems to suggest that the parameters dropped out are indeed affected, but this doesn't seem to be the case (as the reader would suspect). I am also st

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExperimental Behavioral Economics Studies

MethodsDropout