Provably More Efficient Q-Learning in the   One-Sided-Feedback/Full-Feedback Settings

Xiao-Yue Gong; David Simchi-Levi

arXiv:2007.00080·cs.LG·October 6, 2020

Provably More Efficient Q-Learning in the One-Sided-Feedback/Full-Feedback Settings

Xiao-Yue Gong, David Simchi-Levi

PDF

Open Access

TL;DR

This paper introduces new Q-learning algorithms, HQL and FQL, with improved regret bounds for one-sided and full feedback settings, demonstrating better efficiency and scalability in inventory control problems.

Contribution

The paper proposes HQL and FQL algorithms with regret bounds independent of state and action space size, advancing reinforcement learning in feedback-rich environments.

Findings

01

HQL achieves $ ilde{O}(H^3\,\sqrt{T})$ regret.

02

FQL achieves $ ilde{O}(H^2\,\sqrt{T})$ regret.

03

Numerical experiments confirm superior efficiency of the proposed algorithms.

Abstract

Motivated by the episodic version of the classical inventory control problem, we propose a new Q-learning-based algorithm, Elimination-Based Half-Q-Learning (HQL), that enjoys improved efficiency over existing algorithms for a wide variety of problems in the one-sided-feedback setting. We also provide a simpler variant of the algorithm, Full-Q-Learning (FQL), for the full-feedback setting. We establish that HQL incurs $\tilde{O} (H^{3} T)$ regret and FQL incurs $\tilde{O} (H^{2} T)$ regret, where $H$ is the length of each episode and $T$ is the total length of the horizon. The regret bounds are not affected by the possibly huge state and action space. Our numerical experiments demonstrate the superior efficiency of HQL and FQL, and the potential to combine reinforcement learning with richer feedback models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Age of Information Optimization

MethodsQ-Learning