Occupancy Information Ratio: Infinite-Horizon, Information-Directed,   Parameterized Policy Search

Wesley A. Suttle; Alec Koppel; Ji Liu

arXiv:2201.08832·cs.LG·December 29, 2023

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

Wesley A. Suttle, Alec Koppel, Ji Liu

PDF

Open Access

TL;DR

This paper introduces the occupancy information ratio (OIR), a novel objective for infinite-horizon reinforcement learning that balances policy cost and state occupancy entropy, enabling scalable model-free policy search with proven convergence properties.

Contribution

The work formulates the OIR as a new RL objective, develops a theoretical foundation with a policy gradient theorem, and proposes algorithms with convergence guarantees, demonstrating practical benefits in sparse-reward environments.

Findings

01

OIR-based methods outperform vanilla RL in sparse rewards.

02

Finite-time convergence to global optimality for REINFORCE-style algorithms.

03

Asymptotic convergence of actor-critic methods to near-global optima.

Abstract

In this work, we propose an information-directed objective for infinite-horizon reinforcement learning (RL), called the occupancy information ratio (OIR), inspired by the information ratio objectives used in previous information-directed sampling schemes for multi-armed bandits and Markov decision processes as well as recent advances in general utility RL. The OIR, comprised of a ratio between the average cost of a policy and the entropy of its induced state occupancy measure, enjoys rich underlying structure and presents an objective to which scalable, model-free policy search methods naturally apply. Specifically, we show by leveraging connections between quasiconcave optimization and the linear programming theory for Markov decision processes that the OIR problem can be transformed and solved via concave programming methods when the underlying model is known. Since model knowledge is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSmart Grid Energy Management · Advanced Bandit Algorithms Research · Reinforcement Learning in Robotics