Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning
Adam Villaflor, Zhe Huang, Swapnil Pande, John Dolan, Jeff Schneider

TL;DR
This paper introduces a method to reduce optimism bias in sequence modeling for offline reinforcement learning by disentangling policy and world models, improving robustness in stochastic environments like autonomous driving.
Contribution
It proposes explicitly separating policy and world models in sequence modeling to enhance safety and robustness in offline RL, especially in stochastic environments.
Findings
Outperforms existing methods on autonomous driving simulation tasks.
Reduces optimism bias in stochastic environments.
Enhances safety and robustness of policies.
Abstract
Impressive results in natural language processing (NLP) based on the Transformer neural network architecture have inspired researchers to explore viewing offline reinforcement learning (RL) as a generic sequence modeling problem. Recent works based on this paradigm have achieved state-of-the-art results in several of the mostly deterministic offline Atari and D4RL benchmarks. However, because these methods jointly model the states and actions as a single sequencing problem, they struggle to disentangle the effects of the policy and world dynamics on the return. Thus, in adversarial or stochastic environments, these methods lead to overly optimistic behavior that can be dangerous in safety-critical systems like autonomous driving. In this work, we propose a method that addresses this optimism bias by explicitly disentangling the policy and world models, which allows us at test time to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Autonomous Vehicle Technology and Safety · Reinforcement Learning in Robotics
MethodsAttention Is All You Need · Test · Linear Layer · Absolute Position Encodings · Dropout · Multi-Head Attention · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Layer Normalization · Adam
