An Optimal Policy for Learning Controllable Dynamics by Exploration
Peter N. Loxley

TL;DR
This paper derives an optimal, computationally efficient exploration policy for learning controllable Markov chain dynamics in unknown environments, emphasizing the importance of non-stationary strategies due to certain state types.
Contribution
It introduces a simple, optimal exploration policy for controllable dynamics, accounting for transient, absorbing, and non-backtracking states, with a practical algorithm for implementation.
Findings
The policy maximizes information gain during exploration.
Non-stationary policies are necessary for optimal exploration.
Demonstrated policy effectiveness through examples and dynamic programming comparisons.
Abstract
Controllable Markov chains describe the dynamics of sequential decision making tasks and are the central component in optimal control and reinforcement learning. In this work, we give the general form of an optimal policy for learning controllable dynamics in an unknown environment by exploring over a limited time horizon. This policy is simple to implement and efficient to compute, and allows an agent to ``learn by exploring" as it maximizes its information gain in a greedy fashion by selecting controls from a constraint set that changes over time during exploration. We give a simple parameterization for the set of controls, and present an algorithm for finding an optimal policy. The reason for this policy is due to the existence of certain types of states that restrict control of the dynamics; such as transient states, absorbing states, and non-backtracking states. We show why the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Advanced Bandit Algorithms Research
