Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound
Tal Fiskus, Uri Shaham

TL;DR
This paper introduces a causal bound framework for deep reinforcement learning that leverages past data more effectively, significantly improving sample efficiency and performance across multiple domains.
Contribution
It presents a novel causal bound on the factual loss in DRL, integrating the Neyman-Rubin framework and utilizing stored past value outputs to enhance learning efficiency.
Findings
Up to 383% higher reward ratio in experiments
Reduced experience replay buffer size by up to 96%
Significant improvement in sample efficiency with negligible cost
Abstract
Deep reinforcement learning (DRL) agents excel in solving complex decision-making tasks across various domains. However, they often require a substantial number of training steps and a vast experience replay buffer, leading to significant computational and resource demands. To address these challenges, we introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL. Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss, which is analogous to the on-policy loss in DRL. This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded. Extensive experiments across the Atari 2600 and MuJoCo domains on various agents, such as DQN and SAC, achieve up to 383% higher reward ratio, outperforming the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAuction Theory and Applications
MethodsExperience Replay · Deep Q-Network · Focus
