A Strongly Asymptotically Optimal Agent in General Environments

Michael K. Cohen; Elliot Catt; Marcus Hutter

arXiv:1903.01021·cs.LG·April 2, 2020

A Strongly Asymptotically Optimal Agent in General Environments

Michael K. Cohen, Elliot Catt, Marcus Hutter

PDF

TL;DR

This paper introduces Inquisitive Reinforcement Learner, an algorithm that achieves strong asymptotic optimality in all computable probabilistic environments, demonstrating promising exploration strategies and empirical performance in grid-worlds.

Contribution

It presents the first policy that is strongly asymptotically optimal across all computable probabilistic environments, with an inquisitive exploration approach.

Findings

01

Inq approaches optimality with probability 1 in all computable environments.

02

Inq's inquisitiveness enhances exploration efficiency.

03

Experimental results show Inq outperforms other asymptotically optimal agents in grid-worlds.

Abstract

Reinforcement Learning agents are expected to eventually perform well. Typically, this takes the form of a guarantee about the asymptotic behavior of an algorithm given some assumptions about the environment. We present an algorithm for a policy whose value approaches the optimal value with probability 1 in all computable probabilistic environments, provided the agent has a bounded horizon. This is known as strong asymptotic optimality, and it was previously unknown whether it was possible for a policy to be strongly asymptotically optimal in the class of all computable probabilistic environments. Our agent, Inquisitive Reinforcement Learner (Inq), is more likely to explore the more it expects an exploratory action to reduce its uncertainty about which environment it is in, hence the term inquisitive. Exploring inquisitively is a strategy that can be applied generally; for more…

Equations111

V^{\pi}_{\nu}(h_{<t}):=\frac{1}{\Gamma_{t}}\mathbb{E}^{\pi}_{\nu}\left[\sum_{k=t}^{\infty}\gamma_{k}r_{k}\Biggm{|}h_{<t}\right]

V^{\pi}_{\nu}(h_{<t}):=\frac{1}{\Gamma_{t}}\mathbb{E}^{\pi}_{\nu}\left[\sum_{k=t}^{\infty}\gamma_{k}r_{k}\Biggm{|}h_{<t}\right]

π_{ν}^{*} (\cdot) := π \in Π argmax V_{ν}^{π} (\cdot)

π_{ν}^{*} (\cdot) := π \in Π argmax V_{ν}^{π} (\cdot)

IG (h^{'} ∣ h_{< t}) := ν \in M \sum w (ν ∣ h_{< t} h^{'}) lo g \frac{w ( ν ∣ h _{< t} h ^{'} )}{w ( ν ∣ h _{< t} )}

IG (h^{'} ∣ h_{< t}) := ν \in M \sum w (ν ∣ h_{< t} h^{'}) lo g \frac{w ( ν ∣ h _{< t} h ^{'} )}{w ( ν ∣ h _{< t} )}

α^{m} : i = 0 ⋃ m - 1 H^{i} \to A

α^{m} : i = 0 ⋃ m - 1 H^{i} \to A

V^{IG} (α^{m}, h_{< t}) := h_{t : t + m - 1} \in H^{m} \sum P_{ξ}^{α^{m}} (h_{< t + m} ∣ h_{< t}) IG (h_{t : t + m - 1} ∣ h_{< t})

V^{IG} (α^{m}, h_{< t}) := h_{t : t + m - 1} \in H^{m} \sum P_{ξ}^{α^{m}} (h_{< t + m} ∣ h_{< t}) IG (h_{t : t + m - 1} ∣ h_{< t})

α_{m, k}^{IG} (h_{< t}) := α^{m} : ⋃_{i = 0}^{m - 1} H^{i} \to A argmax V^{IG} (α^{m}, h_{< t - k})

α_{m, k}^{IG} (h_{< t}) := α^{m} : ⋃_{i = 0}^{m - 1} H^{i} \to A argmax V^{IG} (α^{m}, h_{< t - k})

a_{m, k}^{IG} (h_{< t}) := α_{m, k}^{IG} (h_{< t}) (h_{t - k : t - 1})

a_{m, k}^{IG} (h_{< t}) := α_{m, k}^{IG} (h_{< t}) (h_{t - k : t - 1})

ρ (h_{< t}, m, k) := min {\frac{1}{m ^{2} ( m + 1 )}, η V^{IG} (α_{m, k}^{IG} (h_{< t}), h_{< t - k})}

ρ (h_{< t}, m, k) := min {\frac{1}{m ^{2} ( m + 1 )}, η V^{IG} (α_{m, k}^{IG} (h_{< t}), h_{< t - k})}

β (h_{< t}) := m \in N \sum k < m, t \sum ρ (h_{< t}, m, k) \leq m \in N \sum k < m, t \sum \frac{1}{m ^{2} ( m + 1 )} \leq m \in N \sum k < m \sum \frac{1}{m ^{2} ( m + 1 )} = 1

β (h_{< t}) := m \in N \sum k < m, t \sum ρ (h_{< t}, m, k) \leq m \in N \sum k < m, t \sum \frac{1}{m ^{2} ( m + 1 )} \leq m \in N \sum k < m \sum \frac{1}{m ^{2} ( m + 1 )} = 1

V_{μ}^{*} (h_{< t}) := π \in Π sup V_{μ}^{π} (h_{< t}) = V_{μ}^{π_{μ}^{*}} (h_{< t})

V_{μ}^{*} (h_{< t}) := π \in Π sup V_{μ}^{π} (h_{< t}) = V_{μ}^{π_{μ}^{*}} (h_{< t})

V_{μ}^{*} (h_{< t}) - V_{μ}^{π^{†}} (h_{< t}) \to 0 with P_{μ}^{π^{†}} -prob. 1

V_{μ}^{*} (h_{< t}) - V_{μ}^{π^{†}} (h_{< t}) \to 0 with P_{μ}^{π^{†}} -prob. 1

h_{< t}, n KL (P_{ν_{1}}^{π} ∣∣ P_{ν_{2}}^{π}) := h^{'} \in H^{n} \sum P_{ν_{1}}^{π} (h^{'} ∣ h_{< t}) lo g \frac{P _{ν_{1}}^{π} ( h ^{'} ∣ h _{< t} )}{P _{ν_{2}}^{π} ( h ^{'} ∣ h _{< t} )}

h_{< t}, n KL (P_{ν_{1}}^{π} ∣∣ P_{ν_{2}}^{π}) := h^{'} \in H^{n} \sum P_{ν_{1}}^{π} (h^{'} ∣ h_{< t}) lo g \frac{P _{ν_{1}}^{π} ( h ^{'} ∣ h _{< t} )}{P _{ν_{2}}^{π} ( h ^{'} ∣ h _{< t} )}

V^{\operatorname*{IG}}(\alpha^{m},h_{<t})=\sum_{\nu\in\mathcal{M}}w(\nu|h_{<t})\operatorname*{KL}_{h_{<t},m}\left(\operatorname{P}^{\alpha^{m}}_{\nu}\Bigm{|}\Bigm{|}\operatorname{P}^{\alpha^{m}}_{\xi}\right)

V^{\operatorname*{IG}}(\alpha^{m},h_{<t})=\sum_{\nu\in\mathcal{M}}w(\nu|h_{<t})\operatorname*{KL}_{h_{<t},m}\left(\operatorname{P}^{\alpha^{m}}_{\nu}\Bigm{|}\Bigm{|}\operatorname{P}^{\alpha^{m}}_{\xi}\right)

ρ (h_{< t}, m, k) \to t \to \infty 0 w.p.1

ρ (h_{< t}, m, k) \to t \to \infty 0 w.p.1

β (h_{< t}) \to 0 w.p.1

β (h_{< t}) \to 0 w.p.1

P_{μ}^{α^{m}} (h_{t : t + m - 1} ∣ h_{< t}) - P_{ξ}^{α^{m}} (h_{t : t + m - 1} ∣ h_{< t}) \to t \to \infty 0 w.p.1

P_{μ}^{α^{m}} (h_{t : t + m - 1} ∣ h_{< t}) - P_{ξ}^{α^{m}} (h_{t : t + m - 1} ∣ h_{< t}) \to t \to \infty 0 w.p.1

V^{*}_{\mu}(h_{<t})=\frac{1}{\Gamma_{t}}\mathbb{E}^{\pi^{*}_{\mu}}_{\mu}\left[\sum_{k=t}^{\infty}\gamma_{k}r_{k}\biggm{|}h_{<t}\right]

V^{*}_{\mu}(h_{<t})=\frac{1}{\Gamma_{t}}\mathbb{E}^{\pi^{*}_{\mu}}_{\mu}\left[\sum_{k=t}^{\infty}\gamma_{k}r_{k}\biggm{|}h_{<t}\right]

V^{*\setminus m}_{\mu}(h_{<t}):=\frac{1}{\Gamma_{t}}\mathbb{E}^{\pi^{*}_{\mu}}_{\mu}\left[\sum_{k=t}^{t+m-1}\gamma_{k}r_{k}\biggm{|}h_{<t}\right]

V^{*\setminus m}_{\mu}(h_{<t}):=\frac{1}{\Gamma_{t}}\mathbb{E}^{\pi^{*}_{\mu}}_{\mu}\left[\sum_{k=t}^{t+m-1}\gamma_{k}r_{k}\biggm{|}h_{<t}\right]

∣ V_{μ}^{*} (h_{< t}) - V_{μ}^{*∖ m} (h_{< t}) ∣ \leq \frac{Γ _{t + m}}{Γ _{t}} \leq ε

∣ V_{μ}^{*} (h_{< t}) - V_{μ}^{*∖ m} (h_{< t}) ∣ \leq \frac{Γ _{t + m}}{Γ _{t}} \leq ε

V_{μ}^{*} (h_{< t})

V_{μ}^{*} (h_{< t})

\leq V_{μ}^{*∖ m} (h_{< t}) + ε

= \frac{1}{Γ _{t}} h_{t : t + m - 1} \in H^{m} \sum P_{μ}^{π_{μ}^{*}} (h_{t : t + m - 1} ∣ h_{< t}) k = t \sum t + m - 1 γ_{k} r_{k} + ε

\leq \exists T_{1} \forall t > T_{1} \frac{1}{Γ _{t}} h_{t : t + m - 1} \in H^{m} \sum P_{ξ}^{π_{μ}^{*}} (h_{t : t + m - 1} ∣ h_{< t}) k = t \sum t + m - 1 γ_{k} r_{k} + 2 ε

\displaystyle\operatorname*{\leq}^{(b)}\frac{1}{\Gamma_{t}}\mathbb{E}^{\pi^{*}_{\mu}}_{\xi}\left[\sum_{k=t}^{\infty}\gamma_{k}r_{k}\biggm{|}h_{<t}\right]+2\varepsilon

\displaystyle\operatorname*{\leq}^{(c)}\frac{1}{\Gamma_{t}}\mathbb{E}^{\pi^{*}_{\xi}}_{\xi}\left[\sum_{k=t}^{\infty}\gamma_{k}r_{k}\biggm{|}h_{<t}\right]+2\varepsilon

\leq (d) \frac{1}{Γ _{t}} h_{t : t + m - 1} \in H^{m} \sum P_{ξ}^{π_{ξ}^{*}} (h_{t : t + m - 1} ∣ h_{< t}) k = t \sum t + m - 1 γ_{k} r_{k} + 3 ε

\leq \exists T_{2} \forall t > T_{2} \frac{1}{Γ _{t}} h_{t : t + m - 1} \in H^{m} \sum P_{μ}^{π_{ξ}^{*}} (h_{t : t + m - 1} ∣ h_{< t}) k = t \sum t + m - 1 γ_{k} r_{k} + 4 ε

\leq \exists T_{3} \forall t > T_{3} \frac{1}{Γ _{t}} h_{t : t + m - 1} \in H^{m} \sum \frac{P _{μ}^{π^{†}} ( h _{t : t + m - 1} ∣ h _{< t} )}{\prod _{k = t}^{t + m - 1} ( 1 - β ( h _{< k} ))} k = t \sum t + m - 1 γ_{k} r_{k}

+ 4 ε

\leq \frac{1}{Γ _{t}} h_{t : t + m - 1} \in H^{m} \sum \frac{P _{μ}^{π^{†}} ( h _{t : t + m - 1} ∣ h _{< t} )}{( 1 - max _{t \leq k < t + m} β ( h _{< k} ) ) ^{m}} k = t \sum t + m - 1 γ_{k} r_{k}

+ 4 ε

\leq \exists T_{4}, ε^{'} > 0 \forall t > T_{4} \frac{1}{Γ _{t}} h_{t : t + m - 1} \in H^{m} \sum \frac{P _{μ}^{π^{†}} ( h _{t : t + m - 1} ∣ h _{< t} )}{( 1 - ε ^{'} ) ^{m}} k = t \sum t + m - 1 γ_{k} r_{k}

+ 4 ε

\displaystyle\operatorname*{\leq}^{(h)}\frac{1}{(1-\varepsilon^{\prime})^{m}\Gamma_{t}}\mathbb{E}^{\pi^{\dagger}}_{\mu}\left[\sum_{k=t}^{\infty}\gamma_{k}r_{k}\biggm{|}h_{<t}\right]+4\varepsilon

= \frac{1}{( 1 - ε ^{'} ) ^{m}} V_{μ}^{π^{†}} (h_{< t}) + 4 ε

= V_{μ}^{π^{†}} (h_{< t}) + 4 ε + (\frac{1}{( 1 - ε ^{'} ) ^{m}} - 1) V_{μ}^{π^{†}} (h_{< t})

\leq (i) V_{μ}^{π^{†}} (h_{< t}) + 4 ε + (\frac{1}{( 1 - ε ^{'} ) ^{m}} - 1)

\forall δ > 0 \exists T \forall t > T : V_{μ}^{*} (h_{< t}) - V_{μ}^{π^{†}} (h_{< t}) < δ w.p.1

\forall δ > 0 \exists T \forall t > T : V_{μ}^{*} (h_{< t}) - V_{μ}^{π^{†}} (h_{< t}) < δ w.p.1

V_{μ}^{*} (h_{< t}) - V_{μ}^{π^{†}} (h_{< t}) \to 0 w.p.1

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Strongly Asymptotically Optimal Agent in General Environments

Michael K. Cohen1111Contact Author

Elliot Catt1&Marcus Hutter1

1Australian National University

{michael.cohen, elliot.carpentercatt, marcus.hutter}@anu.edu.au

Abstract

Reinforcement Learning agents are expected to eventually perform well. Typically, this takes the form of a guarantee about the asymptotic behavior of an algorithm given some assumptions about the environment. We present an algorithm for a policy whose value approaches the optimal value with probability 1 in all computable probabilistic environments, provided the agent has a bounded horizon. This is known as strong asymptotic optimality, and it was previously unknown whether it was possible for a policy to be strongly asymptotically optimal in the class of all computable probabilistic environments. Our agent, Inquisitive Reinforcement Learner (Inq), is more likely to explore the more it expects an exploratory action to reduce its uncertainty about which environment it is in, hence the term inquisitive. Exploring inquisitively is a strategy that can be applied generally; for more manageable environment classes, inquisitiveness is tractable. We conducted experiments in “grid-worlds” to compare the Inquisitive Reinforcement Learner to other weakly asymptotically optimal agents.

1 Introduction

“Efforts to solve [an instance of the exploration-exploitation problem] so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage.” –Peter Whittle Whittle [1979]

The Allied analysts were considering the simplest possible problem in which there is a trade-off to be made between exploiting, taking the apparently best option, and exploring, choosing a different option to learn more. We tackle what we consider the most difficult instance of the exploration-exploitation trade-off problem: when the environment could be any computable probability distribution, not just a multi-armed bandit, how can one achieve optimal performance in the limit?

Our work is within the Reinforcement Learning (RL) paradigm: an agent selects an action, and the environment responds with an observation and a reward. The interaction may end, or it may continue forever. Each interaction cycle is called a timestep. The agent has a discount function that weights its relative concern for the reward it achieves at various future timesteps. The agent’s job is to select actions that maximize the total expected discounted reward it achieves in its lifetime. The “value” of an agent’s policy at a certain point in time is the expected total discounted reward it achieves after that time if it follows that policy. One formal specification of the exploration-exploitation problem is: what policy can an agent follow so that the policy’s value approaches the value of the optimal informed policy with probability 1, even when the agent doesn’t start out knowing the true dynamics of its environment?

Most work in RL makes strong assumptions about the environment—that the environment is Markov, for instance. Impressive recent development in the field of reinforcement learning often makes use of the Markov assumption, including Deep Q Networks Mnih et al. [2015], A3C Mnih et al. [2016], Rainbow Hessel et al. [2018], and AlphaZero Silver et al. [2017]. Another example of making strong assumptions in RL comes from some model-based algorithms that implicitly assume that the environment is representable by, for example, a fixed-size neural network, or whatever construct is used to model the environment. We do not make any such assumptions.

Many recent developments in RL are largely about tractably learning to exploit; how to explore intelligently is a separate problem. We address the latter problem. Our approach, inquisitiveness, is based on Orseau et al.’s Orseau et al. [2013] Knowledge Seeking Agent for Stochastic Environments, which selects the actions that best inform the agent about what environment it is in. Our Inquisitive Reinforcement Learner (Inq) explores like a knowledge seeking agent, and is more likely to explore when there is apparently (according to its current beliefs) more to be learned. Sometimes exploring well requires “expeditions,” or many consecutive exploratory actions. Inq entertains expeditions of all lengths, although it follows the longer ones less often, and it doesn’t resolutely commit in advance to seeing the expedition through.

This is a very human approach to information acquisition. When we spot an opportunity to learn something about our natural environment, we feel inquisitive. We get distracted. We are inclined to check it out, even if we don’t see directly in advance how this information might help us better achieve our goals. Moreover, if we can tell that the opportunity to learn something requires a longer term project, we may find ourselves less inquisitive.

For the class of computable environments (stochastic environments that follow a computable probability distribution), it was previously unknown whether any policy could achieve strong asymptotic optimality (convergence of the value to optimality with probability 1). Lattimore et al. Lattimore and Hutter [2011] showed that no deterministic policy could achieve this. The key advantage that stochastic policies have is that they can let the exploration probability go to [math] while still exploring infinitely often. (For example, an agent that explores with probability $1/t$ at time $t$ still explores infinitely often).

There is a weaker notion of optimality–“weak asymptotic optimality”–for which positive results already exist; this condition requires that the average value over the agent’s lifetime approach optimality. Lattimore et al. Lattimore and Hutter [2011] identified a weakly asymptotically optimal agent for deterministic computable environments; the agent maintains a list of environments consistent with its observations, exploiting as if it is in the first such one, and exploring in bursts. A recent algorithm for a Thompson Sampling Bayesian agent was shown, with an elegant proof, to be weakly asymptotically optimal in all computable environments, but not strongly asymptotically optimal Leike et al. [2016].

Most work in RL regards (Partially Observable) Markov Decision Processes (PO)MDPs. However, environments that enter completely novel states infinitely often render (PO)MDP algorithms helpless. For example, an RL agent acting as a chatbot, optimizing a function, or proving mathematical theorems would struggle to model the environment as an MDP, and would likely require an exploration mechanism like ours. In the chatbot case, for instance, as a conversation with a person progresses, the person never returns to the same state.

If we formally compare Inq to existing algorithms in MDPs, we find that many achieve asymptotic optimality. Epsilon-greedy, upper confidence bound, and Thompson sampling exploration strategies suffice in MDPs. Our primary motivation is for the sorts of environments described above. To discriminate between exploratory approaches in ergodic MDPs, one can formally bound regret, and we would like to do this for Inq in the future.

For comparison, some algorithms which use the MDP formalism also consider information-theoretic approaches to exploration, such as VIME Houthooft et al. [2016], the agent in Still [2009], and TEXPLORE-VANIR Hester and Stone [2012].

In Section 2, we formally describe the RL setup and present notation. In Section 3, we present the algorithm for Inq. In Section 4, we prove our main result: that Inq is strongly asymptotically optimal. In Section 5, we present experimental results comparing Inq to weakly asymptotically optimal agents. Finally, we discuss the relevance of this exploration regime to tractable algorithms. Appendix A collates notation and definitions for quick reference. Appendix B contains the proofs of the lemmas.

2 Notation

We follow the notation of Orseau, et al. Orseau et al. [2013]. The reinforcement learning setup is as follows: $\mathcal{A}$ is a finite set of actions available to the agent; $\mathcal{O}$ is a finite set of observations it might observe, and $\mathcal{R}=[0,1]\cap\mathbb{Q}$ is the set of possible rewards. The set of all possible interactions in a timestep is $\mathcal{H}:=\mathcal{A}\times\mathcal{O}\times\mathcal{R}$ . At every timestep, one element from this set occurs. A reinforcement learner’s policy $\pi$ is a stochastic function which outputs an action given an interaction history, denoted by $\pi:\mathcal{H}^{*}\rightsquigarrow\mathcal{A}$ . ( $\mathcal{X}^{*}:=\bigcup_{i=0}^{\infty}\mathcal{X}^{i}$ represents all finite strings from an alphabet $\mathcal{X}$ ). An environment is a stochastic function which outputs an observation and reward given an interaction history and an action: $\nu:\mathcal{H}^{*}\times\mathcal{A}\rightsquigarrow\mathcal{O}\times\mathcal{R}$ . For a stochastic function $f:\mathcal{X}\to\mathcal{Y}$ , $f(y|x)$ denotes the probability that $f$ outputs $y\in\mathcal{Y}$ when $x\in\mathcal{X}$ is input.

A policy and an environment induce a probability measure over $\mathcal{H}^{\infty}$ , the set of all possible infinite histories: for $h\in\mathcal{H}^{*}$ , $\operatorname{P}^{\pi}_{\nu}(h)$ denotes the probability that an infinite history begins with $h$ when actions are sampled from the policy $\pi$ , and observations and rewards are sampled from the environment $\nu$ . Formally, we define this inductively: $\operatorname{P}^{\pi}_{\nu}(\epsilon)\mapsto 1$ , where $\epsilon$ is the empty history, and for $h\in\mathcal{H}^{*}$ , $a\in\mathcal{A}$ , $o\in\mathcal{O}$ , $r\in\mathcal{R}$ , we define $\operatorname{P}^{\pi}_{\nu}(haor)\mapsto\operatorname{P}^{\pi}_{\nu}(h)\pi(a|h)\nu(or|ha)$ . In an infinite history $h_{1:\infty}\in\mathcal{H}^{\infty}$ , $a_{t}$ , $o_{t}$ , and $r_{t}$ refer to the $t$ th action, observation and reward, and $h_{t}$ refers to the $t$ th timestep: $a_{t}o_{t}r_{t}$ . $h_{<t}$ refers to the first $t-1$ timesteps, and $h_{t:k}$ refers to the string of timesteps $t$ through $k$ (inclusive). Strings of actions, observations, and rewards are notated similarly.

A Bayesian agent deems a class of environments a priori feasible. Its “beliefs” take the form of a probability distribution over which environment is the true one. We call this the agent’s belief distribution. In our formulation, Inq considers any computable environment feasible, and starts with a prior belief distribution based on the environments’ Kolmogorov complexities: that is, the length of the shortest program that computes the environment on some reference machine. However, all our results hold as long as the true environment is contained in the class of environments that are considered feasible, and as long as the prior belief distribution assigns nonzero probability to each environment in the class. We take $\mathcal{M}$ to be the class of all computable environments, and $w(\nu):=2^{-K(\nu)(1+\varepsilon)}/\mathcal{N}$ to be the prior probability of the environment $\nu$ , where $K$ is the Kolmogorov complexity, $\varepsilon>0$ , and $\mathcal{N}$ is a normalization constant. ( $\varepsilon>0$ ensures the prior has finite entropy, which facilitates analysis.) A smaller class with a different prior probability could easily be substituted for $\mathcal{M}$ and $w(\nu)$ .

We use $\xi$ to denote the agent’s beliefs about future observations. Together with a policy $\pi$ it defines a Bayesian mixture measure: $\operatorname{P}^{\pi}_{\xi}(\cdot):=\sum_{\nu\in\mathcal{M}}w(\nu)\operatorname{P}^{\pi}_{\nu}(\cdot)$ . The posterior belief distribution of the agent after observing a history $h\in\mathcal{H}^{*}$ is $w(\nu|h):=w(\nu)\operatorname{P}^{\pi^{\prime}}_{\nu}(h)/\operatorname{P}^{\pi^{\prime}}_{\xi}(h)$ . This definition is independent of the choice of $\pi^{\prime}$ as long as $\operatorname{P}^{\pi^{\prime}}_{\xi}(h)>0$ ; we can fix a reference policy $\pi^{\prime}$ just for this definition if we like. We sometimes also refer to the conditional distribution $\xi(or|ha):=\sum_{\nu\in\mathcal{M}}w(\nu|h)\nu(or|ha)$ .

The agent’s discount at a timestep is denoted $\gamma_{t}$ . To normalize the agent’s policy’s value to $[0,1]$ , we introduce $\Gamma_{t}:=\sum_{k=t}^{\infty}\gamma_{k}$ . (Normalization makes value convergence nontrivial). We consider an agent with a bounded horizon: $\forall\varepsilon>0\ \exists m\ \forall t:\Gamma_{t+m}/\Gamma_{t}\leq\varepsilon$ . Intuitively, this means that the agent does not become more and more farsighted over time. Note this does not require a finite horizon. A classic discount function giving a bounded horizon is a geometric one: for $0\leq\gamma<1$ , $\gamma_{t}=\gamma^{t}$ . The value of a policy $\pi$ in an environment $\nu$ , given a history $h_{<t}\in\mathcal{H}^{t-1}$ , is

[TABLE]

Here, the expectation is with respect to the probability measure $\operatorname{P}^{\pi}_{\nu}$ . Reinforcement Learning is the attempt to find a policy that makes this value high, without access to $\nu$ .

3 Inquisitive Reinforcement Learner

We first describe how Inq exploits, then how it explores. It exploits by maximizing the discounted sum of its reward in expectation over its current beliefs, and it explores by following maximally informative “exploratory expeditions” of various lengths.

An optimal policy with respect to an environment $\nu$ is a policy that maximizes the value.

[TABLE]

where $\Pi=\mathcal{H}^{*}\rightsquigarrow\mathcal{A}$ is the space of all policies. An optimal deterministic policy always exists Lattimore and Hutter [2014b]. When exploiting, Inq simply maximizes the value according to its belief distribution $\xi$ . Since this policy is deterministic, we write $a^{*}(h_{<t})$ to mean the unique action at time $t$ for which $\pi_{\xi}^{*}(a|h_{<t})=1$ . That is the exploitative action.

The most interesting feature of Inq is how it gets distracted by the opportunity to explore. Inq explores to learn. An agent has learned from an observation if its belief distribution $w$ changes significantly after making that observation. If the belief distribution has hardly changed, then the observation was not very informative. The typical information-theoretic measure for how well a distribution $Q$ approximates a distribution $P$ is the KL-divergence, $\operatorname*{KL}(P||Q)$ . Thus, a principled way to quantify the information that an agent gains in a timestep is the KL-divergence from the belief distribution at time $t+1$ to the belief distribution at time $t$ . This is the rationale behind the construction of Orseau, et al.’s Orseau et al. [2013] Knowledge Seeking Agent, which maximizes this expected information gain.

Letting $h_{<t}\in\mathcal{H}^{t-1}$ and $h^{\prime}\in\mathcal{H}^{*}$ , the information gain at time $t$ is defined:

[TABLE]

Recall that $w(\nu|h)$ is the posterior probability assigned to $\nu$ after observing $h$ .

An $m$ -step expedition, denoted $\alpha^{m}$ , represents all contingencies for how an agent will act for the next $m$ timesteps. It is a deterministic policy that takes history-fragments of length less than $m$ and returns an action:

[TABLE]

$\operatorname{P}^{\alpha^{m}}_{\xi}(h_{<t+k}|h_{<t})$ is a conditional distribution defined for $0\leq k\leq m$ , which represents the conditional probability of observing $h_{<t+k}$ if the expedition $\alpha^{m}$ is followed starting at time $t$ , after observing $h_{<t}$ . Now we can consider the information-gain value of an $m$ -step expedition. It is the expected information gain upon following that expedition:

[TABLE]

At a time $t$ , one might consider many expeditions: the one-step expedition which maximizes expected information gain, the two-step expedition doing the same, etc. Or one might consider carrying on with an expedition that began three timesteps ago.

Definition 1.

At time $t$ , the $m$ - $k$ expedition is the $m$ -step expedition beginning at time $t-k$ which maximized the expected information gain from that point.222Ties in the argmax are broken arbitrarily.

[TABLE]

Example expeditions are diagrammed in Figure 1.

Expeditions are functions which return an action given what has been seen so far on the expedition. The $m$ - $k$ exploratory action is the action to take at time $t$ according to the $m$ - $k$ expedition:

[TABLE]

Naturally, this is only defined for $k<m,t$ , since the expedition function can’t accept a history fragment of length $\geq m$ , and $t-k$ must be positive. Note also that if $k=0$ , $h_{t-k:t-1}$ evaluates to the empty string, $\epsilon$ .

The reason Inq doesn’t ignore expeditions that started in the past is that Inq must have some chance of actually executing the whole expedition (for every expedition). If the probability of completing an expedition is 0, one cannot use it for a bound on Inq’s belief-accuracy.

Definition 2.

Let $\rho(h_{<t},m,k)$ be the probability of taking the $m$ - $k$ exploratory action after observing a history $h_{<t}$ .

[TABLE]

where $\eta$ is an exploration constant.

Note in the definition of $\rho(h_{<t},m,k)$ that the probability of following an expedition goes to [math] if the expected information gain from that expedition goes to [math]. The first term in the $\min$ ensures the probabilities will not sum to more than 1. The total probability of exploration is defined:

[TABLE]

The feature that makes Inq inquisitive is that $\rho(h_{<t},m,k)$ is proportional to the expected information gain from the $m$ - $k$ expedition, $V^{\operatorname*{IG}}(\alpha^{\operatorname*{IG}}_{m,k}(h_{<t}),h_{<t-k})$ . Note that completing an $m$ -step expedition requires randomly deciding to explore in that way on $m$ separate occasions. While this may seem inefficient, if the agent always got boxed into long expeditions, the value of its policy would plummet infinitely often.

Finally, Inq’s policy $\pi^{\dagger}$ , defined in Algorithm 1, takes the $m$ - $k$ exploratory action with probability $\rho(\cdot,m,k)$ , and takes the exploitative action otherwise.333This algorithm is written in a simplified way that does not halt, but if a real number in $[0,1]$ is sampled first, the actions can be assigned to disjoint intervals successively until the sampled real number lands in one of them.

4 Strong Asymptotic Optimality

Here we present our central result: that the value of $\pi^{\dagger}$ approaches the optimal value. We present the theorem, motivate the result, and proceed to the proof. We recommend the reader have Appendix A at hand for quickly looking up definitions and notation.

Before presenting the theorem, we clarify an assumption, and define the optimal value. We call the true environment $\mu$ , and we assume that $\mu\in\mathcal{M}$ . For $\mathcal{M}$ the class of computable environments, this is a very unassuming assumption. The optimal value is simply the value of the optimal policy with respect to the true environment:

[TABLE]

Recall also that we have assumed the agent has a bounded horizon in the sense that $\forall\varepsilon\ \exists m\ \forall t:\Gamma_{t+m}/\Gamma_{t}\leq\varepsilon$ . The Strong Asymptotic Optimality theorem is that under these conditions, the value of Inq’s policy approaches the optimal value with probability 1, when actions are sampled from Inq’s policy and observations and rewards are sampled from the true environment $\mu$ .

Theorem 3 (Strong Asymptotic Optimality).

As $t\to\infty$ ,

[TABLE]

where $\mu\in\mathcal{M}$ is the true environment.

For a Bayesian agent, uncertainty about on-policy observations goes to [math]. Since “on-policy” for Inq includes, with some probability, all maximally informative expeditions, Inq eventually has little uncertainty about the result of any course of action, and can therefore successfully select the optimal course. For any fixed horizon, Inq’s mixture measure $\xi$ approaches the true environment $\mu$ .

We use the following notation for a particular KL-divergence that plays a central role in the proof:

[TABLE]

This quantifies the difference between the expected observations of two different environments that would arise in the next $n$ timesteps when following policy $\pi$ . $\operatorname*{KL}_{h_{<t},\infty}$ denotes the limit of the above as $n\to\infty$ , which exists by [Orseau et al., 2013, proof of Theorem 3].

In dealing with the KL-divergence, we simplify matters by asserting that $0\log 0:=0$ , and $0\log\frac{0}{0}:=0$ .

We begin with a lemma that equates the information gain value of an expedition with the expected prediction error. The KL-divergence on the right hand side represents how different $\nu$ and $\xi$ appear when following the expedition in question.

Lemma 4.

[TABLE]

Proofs of Lemmas appear in Appendix B.

Recall that $w(\nu|h_{<t})$ is the posterior weight that Inq assigns to the environment $\nu$ after observing $h_{<t}$ . We show that the infimum of this value is strictly positive with probability 1.

Lemma 5.

$\inf_{t}w(\mu|h_{<t})>0\ \ \textrm{w.$ \operatorname{P}^{\pi}_{\mu} $-p. 1}$ **

Next, we show that every exploration probability $\rho(h_{<t},m,k)$ goes to [math]. From here, all “w.p.1” statements mean with $\operatorname{P}^{\pi^{\dagger}}_{\mu}$ -probability 1, if not otherwise specified.

Lemma 6.

[TABLE]

The essence of the proof is that with a finite-entropy prior, there is only a finite amount of information to gain, so the expected information gain (and the exploration probability) goes to 0.

Next, we show that the total exploration probability goes to 0:

Lemma 7.

[TABLE]

Lemma 8 shows that the probabilities assigned by $\xi$ converge to those of $\mu$ .

Lemma 8.

$\forall m\in\mathbb{N}$ , $h_{t:t+m-1}\in\mathcal{H}^{m}$ , $\alpha^{m}:\ \bigcup_{i=0}^{m-1}\mathcal{H}^{i}\to\mathcal{A}$ :

[TABLE]

The proof of Lemma 8 roughly follows the following argument: if all exploration probabilities go to [math], then the informativeness of the maximally informative expeditions goes to 0, so the informativeness of all expeditions goes to 0, meaning the prediction error goes to 0.

Finally, we prove the Strong Asymptotic Optimality Theorem: $V^{*}_{\mu}(h_{<t})-V^{\pi^{\dagger}}_{\mu}(h_{<t})\to 0\ \ \textrm{with }\operatorname{P}^{\pi^{\dagger}}_{\mu}\textrm{\!\!-prob. 1}$ .

Proof of Theorem 3.

Let $\varepsilon>0$ . Since the agent has a bounded horizon, there exists an $m$ such that for all $t$ , $\frac{\Gamma_{t+m}}{\Gamma_{t}}\leq\varepsilon$ . Recall

[TABLE]

Using the $m$ from above, let

[TABLE]

Since $r_{t}\in[0,1]$ ,

[TABLE]

We continue from there:

[TABLE]

$(a)$ , $(e)$ , $(f)$ , and $(g)$ all hold with probability 1. $(a)$ follows from Lemma 8: for all $m$ , $\operatorname{P}^{\pi}_{\xi}(\cdot|h_{<t})\to\operatorname{P}^{\pi}_{\mu}(\cdot|h_{<t})$ for all conditional probabilities of histories of length $m$ , with probability 1, and the countable sum is bounded (by $\Gamma_{t}$ ). $(b)$ follows from adding more non-negative terms to the sum. $(c)$ follows $\pi^{*}_{\xi}$ being the $\xi$ -optimal policy, and therefore it accrues at least as much expected reward in environment $\xi$ as $\pi^{*}_{\mu}$ does. $(d)$ follows from $\sum_{k=t+m}^{\infty}\gamma_{k}/\Gamma_{t}=\Gamma_{t+m}/\Gamma_{t}\leq\varepsilon$ , and $r_{t}\in[0,1]$ . $(e)$ follows from Lemma 8 just as $(a)$ did. $(f)$ follows because the product in the denominator is the probability that $\pi^{\dagger}$ mimics $\pi^{*}_{\xi}$ for $m$ consecutive timesteps, and by Lemma 7 there is a time after which this probability is uniformly strictly positive. $(g)$ follows from Lemma 7: $\beta(h_{<k})\to 0$ with probability 1. $(h)$ follows from adding more non-negative terms to the sum. Finally, $(i)$ follows from the value being normalized to $[0,1]$ by $\Gamma_{t}$ .

$\forall\delta>0\ \exists\varepsilon>0,\varepsilon^{\prime}>0:4\varepsilon+(\frac{1}{(1-\varepsilon^{\prime})^{m}}-1)<\delta$ . Letting $T=\max\{T_{1},T_{2},T_{3},T_{4}\}$ , we can combine the equations above to give

[TABLE]

Since $V^{*}_{\mu}(h_{<t})\geq V^{\pi^{\dagger}}_{\mu}(h_{<t})$ ,

[TABLE]

∎

Strong Asymptotic Optimality is not a guarantee of efficacy; consider an agent that “commits suicide” on the first timestep, and thereafter receives a reward of [math] no matter what it does. This agent is asymptotically optimal, but not very useful. In general, when considering many environments with many different “traps,” bounded regret is impossible to guarantee Hutter [2005], but one can still demand from a reinforcement learner that it make the best of whatever situation it finds itself in by correctly identifying (in the limit) the optimal policy.

We suspect that strong asymptotic optimality would not hold if Inq had an unbounded horizon, since its horizon of concern may grow faster than it can learn about progressively more long-term dynamics of the environment. Going more into the technical details, let $\Delta_{kt}$ be, roughly “at time $t$ , how much does $\xi$ differ from $\mu$ regarding predictions about the next $k$ timesteps?” A lemma in our proof is that $\forall k\ \lim_{t\to\infty}\Delta_{kt}=0$ , but this does not imply, for example, that $\lim_{z\to\infty}\Delta_{zz}=0$ . If the horizon which is necessary to predict is growing over time, Inq might not be strongly asymptotically optimal.

Indeed, we tenuously suspect that it is impossible for an agent with an unbounded time horizon to be strongly asymptotically optimal in the class of all computable environments. If that is true, then the assumptions that our result relies on (namely that the true environment is computable, and the agent has a bounded horizon) are the bare minimum for strong asymptotic optimality to be possible.

Inq is not computable; in fact, no computable policy can be strongly asymptotically optimal in the class of all computable environments (Lattimore, et al. Lattimore and Hutter [2011] show this for deterministic policies, but a simple modification extends this to stochastic policies). For many smaller environment classes, however, Inq would be computable, for example if $\mathcal{M}$ is finite, and perhaps for decidable $\mathcal{M}$ in general. The central result, that inquisitiveness is an effective exploration strategy, applies to any Bayesian agent.

5 Experimental Results

We compared Inq with other known weakly asymptotically optimal agents, Thompson sampling and BayesExp Lattimore and Hutter [2014a], in the grid-world environment using AIXIjs Aslanides [2017] which has previously been used to compare asymptotically optimal agents Aslanides et al. [2017]. We tested in $10\ \times\ 10$ grid-worlds, and $20\ \times\ 20$ grid-worlds, both with a single dispenser with probability of dispensing reward $0.75$ ; that is, if the agent enters that cell, the probability of a reward of 1 is 0.75. Following the conventions of Aslanides et al. [2017] we averaged over 50 simulations, used discount factor $\gamma=0.99$ , 600 MCTS samples, and planning horizon of 6. The planning horizon restricts $m$ , and the number of MCTS samples is an input to $\rho$ UCT Silver and Veness [2010], which we use instead of expectimax. The algorithm for the approximate version of Inq is in Appendix C. The code used for this experiment is available online at https://github.com/ejcatt/aixijs, and this version of Inq can be run in the browser at https://ejcatt.github.io/aixijs/demo.html#inq. We found that using small values for $\eta$ , specifically $\eta\leq 1$ worked well. For our experiments we chose $\eta=1$ .

In the $10\times 10$ grid-worlds Inq performed comparably to both BayesExp and Thompson sampling. However in the $20\times 20$ grid-worlds Inq performed comparably to BayesExp, and outperformed Thompson sampling. This is likely because when the Thomspon Sampling Agent samples an environment with a reward dispenser that is inaccessible within its planning horizon, the agent acts randomly rather than seeking new cells. This is contrast to Inq and BayesExp which always have an incentive to explore the frontier of cells that have not been visited. This is especially relevant in the larger grid where the Thomspon sampling agent is more likely to act as if the dispenser is deep in uncharted territory, rather than nearby. In a grid-world, good exploration is just about visiting new states, which both Inq and BayesExp successfully seek.

6 Conclusion

We have shown that it is possible for an agent with a bounded horizon to be strongly asymptotically optimal in the class of all computable environments. No existing RL agent has as strong an optimality guarantee as Inq. The nature of the exploration regime that accomplishes this is perhaps of wider interest. We formalize an agent that gets distracted from reward maximization by its inquisitiveness: the more it expects to learn from an expedition, the more inclined it is to take it.

We have confirmed experimentally that inquisitiveness is a practical and effective exploration strategy for Bayesian agents with manageable model classes.

There are two main avenues for future work we would like to see. The first regards possible extensions of inquisitiveness: we have defined inquisitiveness for Bayesian agents with countable model-classes, but inquisitiveness could also be defined for a Bayesian agent with a continuous model class, such as a Q-learner using a Bayesian Neural Network. The second avenue regards the theory of strong asymptotic optimality itself: is Inq strongly asymptotically optimal for more farsighted discounters? If not, can it be modified to accomplish that? Or is it indeed impossible for an agent with an unbounded horizon to be strongly asymptotically optimal in the class of computable environments? Answers to these questions, besides being interesting in their own right, will likely inform the design of tractable exploration strategies, in the same way that this work has done.

Acknowledgements

This work was supported by the Open Philanthropy Project AI Scholarship and the Australian Research Council Discovery Projects DP150104590.

Appendix A Definitions and Notation – Quick Reference

[TABLE]

Appendix B Proofs of Lemmas

We begin with a lemma that equates the information gain value of an expedition with the expected prediction error. The KL-divergence on the right hand side represents how different $\nu$ and $\xi$ appear when following the expedition in question.

See 4

Proof.

This result is shown in [Orseau et al., 2013, Equation 4]. ∎

Recall that $w(\nu|h_{<t})$ is the posterior weight that Inq assigns to the environment $\nu$ after observing $h_{<t}$ . We show that the infimum of this value is strictly positive with probability 1.

See 5

Proof.

Suppose $\inf_{t}w(\mu|h_{<t})=0$ . $w(\mu|h_{<t})>0$ for all histories generated by $\operatorname{P}^{\pi}_{\mu}$ . Therefore, $\inf_{t}w(\mu|h_{<t})=0\implies\liminf_{t\to\infty}w(\mu|h_{<t})=0$ , and $\limsup_{t\to\infty}w(\mu|h_{<t})^{-1}=\infty$ . We show that this has probability [math].

Let

[TABLE]

I first show that $z_{t}$ is a $\mu$ -martingale.

[TABLE]

By the martingale convergence theorem $z_{t}\to f(\omega)<\infty\ \ \mathrm{w.p.1}$ , for $\omega\in\Omega$ , the sample space, and some $f:\Omega\to\mathbb{R}$ . Therefore, $\inf_{t}w(\mu|h_{<t})>0\ \ \mathrm{w.p.1}$ . ∎

Next we show that every exploration probability $\rho(h_{<t},m,k)$ goes to [math]. From here, all “w.p.1” statements mean with $\operatorname{P}^{\pi^{\dagger}}_{\mu}$ -probability 1, if not otherwise specified.

See 6

Proof.

$\rho(h_{<t},m,k)=\rho(h_{<t-k},m,0)$ , so we need only show that $\rho(h_{<t},m,0)\to 0$ w.p.1. We do this by showing that the expectation of $\rho(h_{<t},m,0)^{m+1}$ is summable. (This is a stronger result, since it implies that it is summable with probability 1, so the probability that it is greater than $\varepsilon$ infinitely often is 0.) A bit of notational background: $0\in\mathbb{N}$ , and $m\mathbb{N}+i=\{i,i+m,i+2m,...\}$ . Each equation and inequality is explained below.

[TABLE]

For multiple steps in this derivation, note that the information gain is non-negative; this is a property of the KL-divergence. (a) follows from the l.h.s. being one of the non-negative summands of the r.h.s. (b) follows from the definition of $\xi$ . (c) follows from the definition of $\rho$ . (d) substitutes $V^{\operatorname*{IG}}$ for its definition. (e) expands the definition of the expectation. (f) follows because $\pi^{\dagger}$ mimics $\alpha^{\operatorname*{IG}}_{m,0}$ for $m$ consecutive timesteps with probability $\prod_{i=0}^{m-1}\rho(h_{<t+i},m,i)=\rho(h_{<t},m,0)^{m}$ , so the probability of any history under $\operatorname{P}^{\pi^{\dagger}}_{\xi}$ is at least the probability of that history under $\operatorname{P}^{\alpha^{\operatorname*{IG}}_{m,0}}_{\xi}$ times $\rho(h_{<t},m,0)^{m}$ . (g) combines the two expectations, which are now with respect to the same probability measure. (h) expands the definition of the information gain. (i) rearranges the expectations and the sums, and expands $w(\nu|h_{<t+m})$ according to Bayes’ rule. (j) converts the expectation to a expectation with respect to a different probability measure through simple cancellation. (k) implements a change of variable from $t$ to $mk+i$ . (l) moves a sum inside the logarithm. (m) cancels out all terms expect the numerator of the last term and the denominator of the first. (n) follows from all posterior weights being $\leq 1$ . (o) and (p) are obvious. (q) applies the definition of the entropy of a distribution $\mathrm{Ent}(\cdot)$ , and expands the expectation. (r) changes the variable in the expectation; this is the reverse of (i) and (j). (s) applies the definition of the information gain (after inverting the fraction in the logarithm). (t) follows from the non-negativity of the information gain. And (u) is shown in [Orseau et al., 2013, Proposition 13].

Finally,

[TABLE]

∎

Now, we show that the total exploration probability goes to 0: See 7

Proof.

[TABLE]

Each of the terms in the sum approaches [math] with probability 1 by Lemma 6, and because $\rho(h_{<t},m,k)=\rho(h_{<t-k},m,0)$ . Suppose by contradiction $\beta(h_{<t})>\varepsilon>0$ infinitely often. There exists an $M$ such that

[TABLE]

for all $t$ . With that $M$ , then if $\beta(h_{<t})>\varepsilon$ infinitely often, it must be the case that $\sum_{m=0}^{M-1}\sum_{k=0}^{m-1}\rho(h_{<t},m,k)>\varepsilon/2$ infinitely often, but this is a finite sum of terms that all approach [math], a contradiction. ∎

Lemma 8 shows that the probabilities assigned by $\xi$ converge to those of $\mu$ .

See 8

Proof.

Suppose that $0<\varepsilon\leq(\operatorname{P}^{\alpha^{m}}_{\mu}(h_{t:t+m-1}|h_{<t})-\operatorname{P}^{\alpha^{m}}_{\xi}(h_{t:t+m-1}|h_{<t}))^{2}$ for some $h_{t:t+m-1}$ .

[TABLE]

(a) is a result from information theory known as the entropy inequality. (b) follows from the non-negativity of the KL-divergence, and the l.h.s. being one of the summands of the r.h.s. (c) follows from Lemma 4. (d) follows from the definition of the infimum. And (e) follows from the fact that $\alpha^{\operatorname*{IG}}_{m,0}(h_{<t})$ maximizes $V^{\operatorname*{IG}}(\cdot,h_{<t})$ , by definition.

Therefore,

[TABLE]

This has probability 0 by Lemmas 6 and 5. Thus, with probability 1, $\operatorname{P}^{\alpha^{m}}_{\mu}(h_{t:t+m-1}|h_{<t})-\operatorname{P}^{\alpha^{m}}_{\xi}(h_{t:t+m-1}|h_{<t})\to 0$ .

∎

Appendix C Approximation of Inq

Following Aslanides Aslanides [2017], our approximation of Inq calls $\rho$ UCT Silver and Veness [2010] as a subroutine in place of expectimax.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aslanides et al. [2017] John Aslanides, Jan Leike, and Marcus Hutter. Universal reinforcement learning algorithms: survey and experiments. In Proceedings of the 26th International Joint Conference on Artificial Intelligence , pages 1403–1410. AAAI Press, 2017.
2Aslanides [2017] John Aslanides. AIX Ijs: A software demo for general reinforcement learning. ar Xiv preprint ar Xiv:1705.07615 , 2017.
3Hessel et al. [2018] Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proc. of AAAI Conference on Artificial Intelligence , 2018.
4Hester and Stone [2012] Todd Hester and Peter Stone. Intrinsically motivated model learning for a developing curious agent. In 2012 IEEE international conference on development and learning and epigenetic robotics (ICDL) , pages 1–6. IEEE, 2012.
5Houthooft et al. [2016] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Curiosity-driven exploration in deep reinforcement learning via bayesian neural networks. ar Xiv preprint ar Xiv:1605.09674 , 2016.
6Hutter [2005] Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability . Springer, Berlin, 2005.
7Lattimore and Hutter [2011] Tor Lattimore and Marcus Hutter. Asymptotically optimal agents. In Proc. 22nd International Conf. on Algorithmic Learning Theory (ALT’11) , volume 6925 of LNAI , pages 368–382, Espoo, Finland, 2011. Springer.
8Lattimore and Hutter [2014 a] Tor Lattimore and Marcus Hutter. Bayesian reinforcement learning with exploration. In International Conference on Algorithmic Learning Theory , pages 170–184. Springer, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Strongly Asymptotically Optimal Agent in General Environments

Abstract

1 Introduction

2 Notation

3 Inquisitive Reinforcement Learner

Definition 1**.**

Definition 2**.**

4 Strong Asymptotic Optimality

Theorem 3** (Strong Asymptotic Optimality).**

Lemma 4**.**

Lemma 5**.**

Lemma 6**.**

Lemma 7**.**

Lemma 8**.**

Proof of Theorem 3.

5 Experimental Results

6 Conclusion

Acknowledgements

Appendix A Definitions and Notation – Quick Reference

Appendix B Proofs of Lemmas

Proof.

Proof.

Proof.

Proof.

Proof.

Appendix C Approximation of Inq

Definition 1.

Definition 2.

Theorem 3 (Strong Asymptotic Optimality).

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.