SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning
Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, Andrew D. Bagdanov

TL;DR
SPEQ introduces a hybrid RL training approach combining online low-UTD updates with offline high-UTD stabilization phases, significantly reducing computational costs while maintaining high performance.
Contribution
The paper presents SPEQ, a novel RL algorithm that efficiently balances online and offline training phases to improve scalability and reduce computational overhead.
Findings
SPEQ reduces gradient updates by up to 99%.
Training time decreases by up to 78%.
Performance on MuJoCo benchmarks is maintained or improved.
Abstract
High update-to-data (UTD) ratio algorithms in reinforcement learning (RL) improve sample efficiency but incur high computational costs, limiting real-world scalability. We propose Offline Stabilization Phases for Efficient Q-Learning (SPEQ), an RL algorithm that combines low-UTD online training with periodic offline stabilization phases. During these phases, Q-functions are fine-tuned with high UTD ratios on a fixed replay buffer, reducing redundant updates on suboptimal data. This structured training schedule optimally balances computational and sample efficiency, addressing the limitations of both high and low UTD ratio approaches. We empirically demonstrate that SPEQ requires from 40% to 99% fewer gradient updates and 27% to 78% less training time compared to state-of-the-art high UTD ratio methods while maintaining or surpassing their performance on the MuJoCo continuous control…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAge of Information Optimization
MethodsDropout
