SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer
Nathan Samuel de Lara, Florian Shkurti

TL;DR
SMAC introduces a novel offline RL method that learns actor-critics capable of seamless online transfer without performance drops by regularizing the Q-function during training.
Contribution
SMAC regularizes the Q-function to align offline and online maxima, enabling smooth transition and improved performance in offline-to-online RL transfer.
Findings
SMAC converges to offline maxima connected to better online maxima.
Achieves smooth transfer to Soft Actor-Critic and TD3 in all tested tasks.
Reduces regret by 34-58% in most environments.
Abstract
Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques
