Scalable Bilinear $\pi$ Learning Using State and Action Features

Yichen Chen; Lihong Li; Mengdi Wang

arXiv:1804.10328·cs.LG·April 30, 2018·22 cites

Scalable Bilinear $\pi$ Learning Using State and Action Features

Yichen Chen, Lihong Li, Mengdi Wang

PDF

Open Access

TL;DR

This paper introduces a scalable, model-free bilinear $ ext{ extpi}$ learning algorithm for large MDPs that efficiently uses features, operates online with minimal memory, and is proven to be sample-efficient.

Contribution

It develops a novel bilinear $ ext{ extpi}$ learning algorithm that leverages features for scalable, online, and sample-efficient reinforcement learning in large MDPs.

Findings

01

Algorithm has runtime depending on feature count, not MDP size

02

Operates fully online with minimal memory usage

03

Proven to be sample-efficient with linear complexity in parameter dimension

Abstract

Approximate linear programming (ALP) represents one of the major algorithmic families to solve large-scale Markov decision processes (MDP). In this work, we study a primal-dual formulation of the ALP, and develop a scalable, model-free algorithm called bilinear $π$ learning for reinforcement learning when a sampling oracle is provided. This algorithm enjoys a number of advantages. First, it adopts (bi)linear models to represent the high-dimensional value function and state-action distributions, using given state and action features. Its run-time complexity depends on the number of features, not the size of the underlying MDPs. Second, it operates in a fully online fashion without having to store any sample, thus having minimal memory footprint. Third, we prove that it is sample-efficient, solving for the optimal policy to high precision with a sample complexity linear in the dimension…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems