Efficient Offline Reinforcement Learning: First Imitate, then Improve
Adam Jelley, Trevor McInroe, Sam Devlin, Amos Storkey

TL;DR
This paper introduces a hybrid offline reinforcement learning method that combines supervised pre-training with off-policy fine-tuning, resulting in faster and more stable training on standard benchmarks.
Contribution
It proposes a novel approach that pre-trains with supervised learning before applying off-policy reinforcement learning, enhancing efficiency and stability.
Findings
Significantly reduces training time of off-policy algorithms.
Achieves greater stability during training.
Improves performance on standard benchmarks.
Abstract
Supervised imitation-based approaches are often favored over off-policy reinforcement learning approaches for learning policies offline, since their straightforward optimization objective makes them computationally efficient and stable to train. However, their performance is fundamentally limited by the behavior policy that collected the dataset. Off-policy reinforcement learning provides a promising approach for improving on the behavior policy, but training is often computationally inefficient and unstable due to temporal-difference bootstrapping. In this paper, we propose a best-of-both approach by pre-training with supervised learning before improving performance with off-policy reinforcement learning. Specifically, we demonstrate improved efficiency by pre-training an actor with behavior cloning and a critic with a supervised Monte-Carlo value error. We find that we are able to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Behavioral and Psychological Studies
