Average Reward Adjusted Discounted Reinforcement Learning: Near-Blackwell-Optimal Policies for Real-World Applications
Manuel Schneckenreither

TL;DR
This paper introduces a novel reinforcement learning algorithm that assesses average rewards separately, providing near-Blackwell-optimal policies suitable for real-world operations research problems, especially where non-zero rewards are involved.
Contribution
It offers deep theoretical insights into standard discounted RL and develops a new near-Blackwell-optimal algorithm that improves policy inference in complex operational settings.
Findings
The new algorithm infers optimal policies on all tested problems.
It addresses limitations of standard discounted RL with non-zero rewards.
Proven effectiveness on M/M/1 queuing systems.
Abstract
Although in recent years reinforcement learning has become very popular the number of successful applications to different kinds of operations research problems is rather scarce. Reinforcement learning is based on the well-studied dynamic programming technique and thus also aims at finding the best stationary policy for a given Markov Decision Process, but in contrast does not require any model knowledge. The policy is assessed solely on consecutive states (or state-action pairs), which are observed while an agent explores the solution space. The contributions of this paper are manifold. First we provide deep theoretical insights to the widely applied standard discounted reinforcement learning framework, which give rise to the understanding of why these algorithms are inappropriate when permanently provided with non-zero rewards, such as costs or profit. Second, we establish a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Supply Chain and Inventory Management · Scheduling and Optimization Algorithms
