Provably More Efficient Q-Learning in the One-Sided-Feedback/Full-Feedback Settings
Xiao-Yue Gong, David Simchi-Levi

TL;DR
This paper introduces new Q-learning algorithms, HQL and FQL, with improved regret bounds for one-sided and full feedback settings, demonstrating better efficiency and scalability in inventory control problems.
Contribution
The paper proposes HQL and FQL algorithms with regret bounds independent of state and action space size, advancing reinforcement learning in feedback-rich environments.
Findings
HQL achieves $ ilde{O}(H^3\,\sqrt{T})$ regret.
FQL achieves $ ilde{O}(H^2\,\sqrt{T})$ regret.
Numerical experiments confirm superior efficiency of the proposed algorithms.
Abstract
Motivated by the episodic version of the classical inventory control problem, we propose a new Q-learning-based algorithm, Elimination-Based Half-Q-Learning (HQL), that enjoys improved efficiency over existing algorithms for a wide variety of problems in the one-sided-feedback setting. We also provide a simpler variant of the algorithm, Full-Q-Learning (FQL), for the full-feedback setting. We establish that HQL incurs regret and FQL incurs regret, where is the length of each episode and is the total length of the horizon. The regret bounds are not affected by the possibly huge state and action space. Our numerical experiments demonstrate the superior efficiency of HQL and FQL, and the potential to combine reinforcement learning with richer feedback models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Age of Information Optimization
MethodsQ-Learning
