On-line Policy Iteration with Policy Switching for Markov Decision Processes
Hyeong Soo Chang

TL;DR
This paper introduces an on-line policy iteration algorithm with policy switching for Markov decision processes, ensuring convergence to optimal policies through a sequence of policy updates that switch actions at the current state.
Contribution
It develops an off-line policy iteration method integrated with multi-policy switching and adapts it into an on-line asynchronous algorithm for MDPs.
Findings
Sequence converges in finite time for local MDPs.
Sequence converges to global optimal for communicating MDPs.
Ensures monotonicity of value functions during policy updates.
Abstract
Motivated from Bertsekas' recent study on policy iteration (PI) for solving the problems of infinite-horizon discounted Markov decision processes (MDPs) in an on-line setting, we develop an off-line PI integrated with a multi-policy improvement method of policy switching and then adapt its asynchronous variant into on-line PI algorithm that generates a sequence of policies over time. The current policy is updated into the next policy by switching the action only at the current state while ensuring the monotonicity of the value functions of the policies in the sequence. Depending on MDP's state-transition structure, the sequence converges in a finite time to an optimal policy for an associated local MDP. When MDP is communicating, the sequence converges to an optimal policy for the original MDP.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Fuel Cells and Related Materials
