Sequential Off-Policy Learning with Logarithmic Smoothing
Maxime Haddouche, Otmane Sakhi

TL;DR
This paper introduces a new algorithm for sequential off-policy learning that combines Logarithmic Smoothing with PAC-Bayesian analysis, outperforming existing methods in dynamic policy update scenarios.
Contribution
The work presents a simple, theoretically grounded algorithm for sequential off-policy learning that generalizes previous approaches and improves performance with policy updates.
Findings
The proposed algorithms match state-of-the-art offline methods in batch settings.
They outperform traditional methods in sequential policy update scenarios.
Empirical results demonstrate the benefits of the sequential framework and the proposed algorithms.
Abstract
Off-policy learning enables training policies from logged interaction data. Most prior work considers the batch setting, where a policy is learned from data generated by a single behavior policy. In real systems, however, policies are updated and redeployed repeatedly, each time training on all previously collected data while generating new interactions for future updates. This sequential off-policy learning setting is common in practice but remains largely unexplored theoretically. In this work, we present and study a simple algorithm for sequential off-policy learning, combining Logarithmic Smoothing (LS) estimation with online PAC-Bayesian tools. We further show that a principled adjustment to LS improves performance and accelerates convergence under mild conditions. The algorithms introduced generalise previous work: they match state-of-the-art offline approaches in the batch case…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
