Survival Instinct in Offline Reinforcement Learning
Anqi Li, Dipendra Misra, Andrey Kolobov, Ching-An Cheng

TL;DR
This paper uncovers a robustness phenomenon in offline reinforcement learning where algorithms can learn safe policies even with incorrect reward signals, due to a 'survival instinct' driven by pessimism and data biases.
Contribution
It introduces the concept of a 'survival instinct' in offline RL, explaining robustness to reward misspecification through theoretical analysis and empirical validation.
Findings
Offline RL can produce safe policies with wrong rewards.
Pessimism and data biases create a 'survival instinct' in agents.
Conditions on data enable learning from any reward in a class.
Abstract
We present a novel observation about the behavior of offline reinforcement learning (RL) algorithms: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design. We demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and certain implicit biases in common data collection practices. As we prove in this work, pessimism endows the agent with a "survival instinct", i.e., an incentive to stay within the data support…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Mobile Crowdsensing and Crowdsourcing
