Offline RL via Feature-Occupancy Gradient Ascent
Gergely Neu, Nneka Okolo

TL;DR
This paper introduces a new offline reinforcement learning algorithm based on feature-occupancy gradient ascent, which achieves optimal sample complexity and minimal data coverage assumptions in large, infinite-horizon MDPs with linear models.
Contribution
The paper develops a novel gradient ascent algorithm in feature occupancy space with strong theoretical guarantees and minimal data coverage requirements, advancing offline RL in linear MDPs.
Findings
Achieves optimal sample complexity scaling with accuracy
Requires only minimal data coverage assumptions
Easy to implement without prior coverage knowledge
Abstract
We study offline Reinforcement Learning in large infinite-horizon discounted Markov Decision Processes (MDPs) when the reward and transition models are linearly realizable under a known feature map. Starting from the classic linear-program formulation of the optimal control problem in MDPs, we develop a new algorithm that performs a form of gradient ascent in the space of feature occupancies, defined as the expected feature vectors that can potentially be generated by executing policies in the environment. We show that the resulting simple algorithm satisfies strong computational and sample complexity guarantees, achieved under the least restrictive data coverage assumptions known in the literature. In particular, we show that the sample complexity of our method scales optimally with the desired accuracy level and depends on a weak notion of coverage that only requires the empirical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
