Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes
Asaf Cassel, Aviv Rosenberg

TL;DR
This paper introduces a warm-up free policy optimization algorithm for linear MDPs that achieves rate-optimal regret, improving practical implementation and parameter dependence in both adversarial and stochastic settings.
Contribution
It replaces the costly warm-up phase with a simple contraction mechanism, achieving optimal regret with better parameter dependence.
Findings
Achieves rate-optimal regret in linear MDPs.
Eliminates the warm-up phase for practical efficiency.
Improves dependence on horizon and dimension parameters.
Abstract
Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on a costly pure exploration warm-up phase that is hard to implement in practice. This paper eliminates this undesired warm-up phase, replacing it with a simple and efficient contraction mechanism. Our PO algorithm achieves rate-optimal regret with improved dependence on the other parameters of the problem (horizon and function approximation dimension) in two fundamental settings: adversarial losses with full-information feedback and stochastic losses with bandit feedback.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEconomic Policies and Impacts · Energy, Environment, and Transportation Policies
MethodsParrot optimizer: Algorithm and applications to medical problems
