Exploration versus exploitation in reinforcement learning: a stochastic   control approach

Haoran Wang; Thaleia Zariphopoulou; Xunyu Zhou

arXiv:1812.01552·math.OC·February 14, 2019·20 cites

Exploration versus exploitation in reinforcement learning: a stochastic control approach

Haoran Wang, Thaleia Zariphopoulou, Xunyu Zhou

PDF

Open Access

TL;DR

This paper models the exploration-exploitation trade-off in continuous-time reinforcement learning using a stochastic control framework, revealing that optimal exploration strategies are Gaussian and analyzing their properties.

Contribution

It introduces an entropy-regularized formulation for exploration in continuous-time RL and characterizes the optimal Gaussian control distribution in the linear-quadratic setting.

Findings

01

Optimal exploration control is Gaussian, with mean and variance representing exploitation and exploration.

02

Less exploration is needed in more random environments, indicating more learning opportunities.

03

The cost of exploration is proportional to the entropy regularization weight and inversely proportional to the discount rate.

Abstract

We consider reinforcement learning (RL) in continuous time and study the problem of achieving the best trade-off between exploration of a black box environment and exploitation of current knowledge. We propose an entropy-regularized reward function involving the differential entropy of the distributions of actions, and motivate and devise an exploratory formulation for the feature dynamics that captures repetitive learning under exploration. The resulting optimization problem is a revitalization of the classical relaxed stochastic control. We carry out a complete analysis of the problem in the linear--quadratic (LQ) setting and deduce that the optimal feedback control distribution for balancing exploitation and exploration is Gaussian. This in turn interprets and justifies the widely adopted Gaussian exploration in RL, beyond its simplicity for sampling. Moreover, the exploitation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control

MethodsEntropy Regularization