Uniformly Conservative Exploration in Reinforcement Learning
Wanqiao Xu, Jason Yecheng Ma, Kan Xu, Hamsa Bastani, Osbert Bastani

TL;DR
This paper introduces a conservative exploration method in reinforcement learning that ensures safety by outperforming a baseline policy within an exploration budget, using adaptive exploration strategies.
Contribution
It presents a novel algorithm combining UCB exploration with adaptive constraints to ensure safety and conservativeness in RL, applicable to both tabular and continuous state spaces.
Findings
The algorithm achieves low regret while maintaining safety in tabular RL.
Experimental results show effective learning in healthcare tasks like sepsis and HIV treatments.
The approach extends to deep RL for continuous state spaces.
Abstract
A key challenge to deploying reinforcement learning in practice is avoiding excessive (harmful) exploration in individual episodes. We propose a natural constraint on exploration -- \textit{uniformly} outperforming a conservative policy (adaptively estimated from all data observed thus far), up to a per-episode exploration budget. We design a novel algorithm that uses a UCB reinforcement learning policy for exploration, but overrides it as needed to satisfy our exploration constraint with high probability. Importantly, to ensure unbiased exploration across the state space, our algorithm adaptively determines when to explore. We prove that our approach remains conservative while minimizing regret in the tabular setting. We experimentally validate our results on a sepsis treatment task and an HIV treatment task, demonstrating that our algorithm can learn while ensuring good performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Adversarial Robustness in Machine Learning
