Reinforcement Learning When All Actions are Not Always Available

Yash Chandak; Georgios Theocharous; Blossom Metevier; Philip S. Thomas

arXiv:1906.01772·cs.LG·January 22, 2020

Reinforcement Learning When All Actions are Not Always Available

Yash Chandak, Georgios Theocharous, Blossom Metevier, Philip S. Thomas

PDF

1 Repo

TL;DR

This paper introduces new policy gradient algorithms for stochastic action set MDPs, addressing divergence issues and demonstrating their effectiveness on real-world inspired tasks.

Contribution

It proposes variance-reduced policy gradient methods tailored for SAS-MDPs, with convergence guarantees and practical validation.

Findings

01

Algorithms improve stability in SAS-MDPs

02

Demonstrated convergence under certain conditions

03

Effective on real-life inspired decision tasks

Abstract

The Markov decision process (MDP) formulation used to model many real-world sequential decision making problems does not efficiently capture the setting where the set of available decisions (actions) at each time step is stochastic. Recently, the stochastic action set Markov decision process (SAS-MDP) formulation has been proposed, which better captures the concept of a stochastic action set. In this paper we argue that existing RL algorithms for SAS-MDPs can suffer from potential divergence issues, and present new policy gradient algorithms for SAS-MDPs that incorporate variance reduction techniques unique to this setting, and provide conditions for their convergence. We conclude with experiments that demonstrate the practicality of our approaches on tasks inspired by real-life use cases wherein the action set is stochastic.

Equations141

T^{π} v (s) =

T^{π} v (s) =

\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}(R(s,a)+\gamma v(s^{\prime}))\Big{)}

T^{*} v (s) =

T^{*} v (s) =

q (S_{t}, A_{t}) \leftarrow (1 - η) q (S_{t}, A_{t}) + η (R_{t} + γ a \in A_{t + 1} max q (S_{t + 1}, a)) .

q (S_{t}, A_{t}) \leftarrow (1 - η) q (S_{t}, A_{t}) + η (R_{t} + γ a \in A_{t + 1} max q (S_{t + 1}, a)) .

\nabla J (θ) =

\nabla J (θ) =

\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\sum_{a\in\alpha}q^{\theta}(s,a)\frac{\partial\pi^{\theta}(s,\alpha,a)}{\partial\theta}\Big{)}.

∥ \nabla J (θ) - \nabla J (\overset{ˉ}{θ})∥ \leq L ∥ θ - \overset{ˉ}{θ} ∥ \forall θ, \overset{ˉ}{θ} \in Θ.

∥ \nabla J (θ) - \nabla J (\overset{ˉ}{θ})∥ \leq L ∥ θ - \overset{ˉ}{θ} ∥ \forall θ, \overset{ˉ}{θ} \in Θ.

t = 0 \sum \infty η_{θ}^{t} = \infty, t = 0 \sum \infty (η_{θ}^{t})^{2} < \infty.

t = 0 \sum \infty η_{θ}^{t} = \infty, t = 0 \sum \infty (η_{θ}^{t})^{2} < \infty.

F_{θ} =

F_{θ} =

\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\sum_{a\in\alpha}\pi^{\theta}(s,\alpha,a)\psi^{\theta}(s,\alpha,a)\psi^{\theta}(s,\alpha,a)^{\top}\Big{)}.

\frac{\partial}{\partial w} E [\frac{1}{2} t \sum \infty γ^{t} (ψ^{θ} (S_{t}, A_{t}, A_{t})^{⊤} w - q^{θ} (S_{t}, A_{t}))^{2}] = 0,

\frac{\partial}{\partial w} E [\frac{1}{2} t \sum \infty γ^{t} (ψ^{θ} (S_{t}, A_{t}, A_{t})^{⊤} w - q^{θ} (S_{t}, A_{t}))^{2}] = 0,

\nabla J (θ) = E [t = 0 \sum \infty γ^{t} ψ^{θ} (s, α, a) (q^{θ} (s, a) - b (s))] .

\nabla J (θ) = E [t = 0 \sum \infty γ^{t} ψ^{θ} (s, α, a) (q^{θ} (s, a) - b (s))] .

\nabla J (θ)

\nabla J (θ)

A = - (E [B^{⊤} B])^{- 1} E [B^{⊤} C] .

A = - (E [B^{⊤} B])^{- 1} E [B^{⊤} C] .

\frac{d}{d θ} J (θ) = t = 0 \sum \infty s \in S \sum γ^{t} Pr (S_{t} = s ∣ θ) α \in 2^{B} \sum φ (s, α) a \in α \sum q^{θ} (s, a) \frac{\partial π ^{θ} ( s , α , a )}{\partial θ} .

\frac{d}{d θ} J (θ) = t = 0 \sum \infty s \in S \sum γ^{t} Pr (S_{t} = s ∣ θ) α \in 2^{B} \sum φ (s, α) a \in α \sum q^{θ} (s, a) \frac{\partial π ^{θ} ( s , α , a )}{\partial θ} .

\frac{\partial v ^{θ} ( s )}{\partial θ} =

\frac{\partial v ^{θ} ( s )}{\partial θ} =

=

=

=

=

+ α \in 2^{B} \sum φ (s, α) a \in α \sum π^{θ} (s, α, a) \frac{\partial}{\partial θ} s^{'} \in S \sum P (s, a, s^{'}) (R (s, a) + γ v^{θ} (s^{'}))

=

\frac{\partial v ^{θ} ( s )}{\partial θ} =

\frac{\partial v ^{θ} ( s )}{\partial θ} =

+ γ s^{'} \in S \sum Pr (S_{t + 1} = s^{'} ∣ S_{t} = s, θ) \frac{\partial}{\partial θ} (α^{'} \in 2^{B} \sum φ (s^{'}, α^{'}) a^{'} \in α^{'} \sum π^{θ} (s^{'}, α^{'}, a^{'}) q^{θ} (s^{'}, a^{'}))

=

+ γ s^{'} \in S \sum Pr (S_{t + 1} = s^{'} ∣ S_{t} = s, θ) α^{'} \in 2^{B} \sum φ (s^{'}, α^{'}) (a^{'} \in α^{'} \sum \frac{\partial π ^{θ} ( s ^{'} , α ^{'} , a ^{'} )}{\partial θ} q^{θ} (s^{'}, a^{'}) + π^{θ} (s^{'}, α^{'}, a^{'}) \frac{\partial q ^{θ} ( s ^{'} , a ^{'} )}{\partial θ})

=

=

+ γ s^{'} \in S \sum Pr (S_{t + 1} = s^{'} ∣ S_{t} = s, θ) α^{'} \in 2^{B} \sum φ (s^{'}, α^{'}) a^{'} \in α^{'} \sum \frac{\partial π ^{θ} ( s ^{'} , α ^{'} , a ^{'} )}{\partial θ} q^{θ} (s^{'}, a^{'})

+ γ s^{'} \in S \sum Pr (S_{t + 1} = s^{'} ∣ S_{t} = s, θ) α^{'} \in 2^{B} \sum φ (s^{'}, α^{'}) a^{'} \in α^{'} \sum π^{θ} (s^{'}, α^{'}, a^{'}) \frac{\partial}{\partial θ} (s^{''} \in S \sum P (s^{'}, a^{'}, s^{''}) (R (s^{'}, a^{'}) + γ v^{θ} (s^{''})))

=

+ γ s^{'} \in S \sum Pr (S_{t + 1} = s^{'} ∣ S_{t} = s, θ) α^{'} \in 2^{B} \sum φ (s^{'}, α^{'}) a^{'} \in α^{'} \sum \frac{\partial π ^{θ} ( s ^{'} , α ^{'} , a ^{'} )}{\partial θ} q^{θ} (s^{'}, a^{'})

+ γ s^{'} \in S \sum Pr (S_{t + 1} = s^{'} ∣ S_{t} = s, θ) α^{'} \in 2^{B} \sum φ (s^{'}, α^{'}) a^{'} \in α^{'} \sum π^{θ} (s^{'}, α^{'}, a^{'}) s^{''} \in S \sum P (s^{'}, a^{'}, s^{''}) γ \frac{\partial v ^{θ} ( s ^{''} )}{\partial θ}

=

+ second term γ s^{'} \in S \sum Pr (S_{t + 1} = s^{'} ∣ S_{t} = s, θ) α^{'} \in 2^{B} \sum φ (s^{'}, α^{'}) a^{'} \in α^{'} \sum \frac{\partial π ^{θ} ( s ^{'} , α ^{'} , a ^{'} )}{\partial θ} q^{θ} (s^{'}, a^{'})

+ γ^{2} s^{''} \in S \sum Pr (S_{t + 2} = s^{''} ∣ S_{t} = s, θ) \frac{\partial v ^{θ} ( s ^{''} )}{\partial θ} .

\frac{d}{d θ} J (θ)

\frac{d}{d θ} J (θ)

\frac{d}{d θ} J (θ)

\frac{d}{d θ} J (θ)

F_{θ} = t = 0 \sum \infty s \in S \sum γ^{t} Pr (S_{t} = s ∣ θ) α \in 2^{B} \sum φ (s, α) a \in α \sum π^{θ} (s, α, a) ψ (s, α, a) ψ (s, α, a)^{⊤} .

F_{θ} = t = 0 \sum \infty s \in S \sum γ^{t} Pr (S_{t} = s ∣ θ) α \in 2^{B} \sum φ (s, α) a \in α \sum π^{θ} (s, α, a) ψ (s, α, a) ψ (s, α, a)^{⊤} .

E [\frac{\partial ^{2} lo g Pr ( X )}{\partial θ ^{2}}] = - E [\frac{\partial lo g Pr ( X )}{\partial θ} \frac{\partial lo g Pr ( X )}{\partial θ}^{⊤}] .

E [\frac{\partial ^{2} lo g Pr ( X )}{\partial θ ^{2}}] = - E [\frac{\partial lo g Pr ( X )}{\partial θ} \frac{\partial lo g Pr ( X )}{\partial θ}^{⊤}] .

Pr (T_{θ} = τ)

Pr (T_{θ} = τ)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yashchandak/SAS_RL
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Reinforcement Learning When All Actions are Not Always Available

Yash Chandak1 Georgios Theocharous2 Blossom Metevier1 Philip S. Thomas1

1University of Massachusetts Amherst, 2Adobe Research

{ychandak,bmetevier,pthomas}@cs.umass.edu [email protected]

Abstract

The Markov decision process (MDP) formulation used to model many real-world sequential decision making problems does not efficiently capture the setting where the set of available decisions (actions) at each time step is stochastic. Recently, the stochastic action set Markov decision process (SAS-MDP) formulation has been proposed, which better captures the concept of a stochastic action set. In this paper we argue that existing RL algorithms for SAS-MDPs can suffer from potential divergence issues, and present new policy gradient algorithms for SAS-MDPs that incorporate variance reduction techniques unique to this setting, and provide conditions for their convergence. We conclude with experiments that demonstrate the practicality of our approaches on tasks inspired by real-life use cases wherein the action set is stochastic.

Introduction

In many real-world sequential decision making problems, the set of available decisions, which we call the action set, is stochastic. In vehicular routing on a road network (?) or packet routing on the internet (?), the goal is to find the shortest path between a source and destination. However, due to construction, traffic, or other damage to the network, not all pathways are always available. In online advertising (?; ?), the set of available ads can vary due to fluctuations in advertising budgets and promotions. In robotics (?), actuators can fail. In recommender systems (?), the set of possible recommendations can vary based on product availability. These examples capture the broad idea and motivate the question we aim to address: how can we develop efficient learning algorithms for sequential decision making problems wherein the action set can be stochastic?

Sequential decision making problems without stochastic action sets are typically modeled as Markov decision processes (MDPs). Although the MDP formulation is remarkably flexible, and can incorporate concepts like stochastic state transitions, partial observability, and even different (deterministic) action availability depending on the state, it cannot efficiently incorporate stochastic action sets. As a result, algorithms designed for MDPs are not well suited to our setting of interest. Recently, ? (?) laid the foundations for stochastic action set Markov decision processes (SAS-MDPs), that extends MDPs to include stochastic action sets. They also showed how the Q-learning and value iteration algorithms, two classic algorithms for approximating optimal solutions to MDPs, can be extended to SAS-MDPs.

In this paper we show that the lack of convergence guarantees of the Q-learning algorithm, when using function approximators in the MDP setting can potentially get exacerbated in the SAS-MDP setting. We therefore derive policy gradient and natural policy gradient algorithms for the SAS-MDP setting and provide conditions for their almost-sure convergence. Critically, since the introduction of stochastic action sets introduces further uncertainty in the decision making process, variance reduction techniques are of increased importance. We therefore derive new approaches to variance reduction for policy gradient algorithms that are unique to the SAS-MDP setting. We validate our new algorithms empirically on tasks inspired by real-world problems with stochastic action sets.

Related Work

While there is extensive literature on solving sequential decision problems modeled as MDPs (?), there are few methods designed to handle stochastic action sets. Recently, ? (?) laid the foundation for studying MDPs with stochastic action sets by defining the new SAS-MDP problem formulation, which we review in the background section. After defining SAS-MDPs, ? (?) presented and analyzed the model-based value iteration and policy iteration algorithms and the model-free Q-learning algorithm for SAS-MDPs.

In the bandit setting, wherein individual decisions are optimized rather than sequences of dependent decisions, sleeping bandits extend the standard bandit problem formulation to allow for stochastic action sets (?; ?). We focus on the SAS-MDP formulation rather than the sleeping bandit formulation because we are interested in sequential problems. Such sequential problems are more challenging because making optimal decisions requires one to reason about the long-term impact of decisions, which includes reasoning about how a decision will influence the probability that different actions (decisions) will be available in the future.

Although we focus on the model-free setting, wherein the dynamics of the environment are not known a priori to the agent optimizing its decisions, in the alternative model-based setting researchers have considered related problems in the area of stochastic routing (?; ?; ?; ?). In stochastic routing problems, the goal is to find a shortest path on a graph with stochastic availability of edges. The SAS-MDP framework generalizes stochastic routing problems by allowing for sequential decision making problems that are not limited to shortest path problems.

Background

MDPs and SAS-MDPs (?) are mathematical formulations of sequential decision problems. Before defining SAS-MDPs, we define MDPs. We refer to the entity interacting with an MDP or SAS-MDP and trying to optimize its decisions as the agent.

Formally, an MDP is a tuple $\mathcal{M}=(\mathcal{S},\mathcal{B},\mathcal{P},\mathcal{R},\gamma,d_{0})$ . $\mathcal{S}$ is the set of all possible states that the agent can be in, called the state set. Although our math notation assumes that $\mathcal{S}$ is countable, our primary results extend to MDPs with continuous states. $\mathcal{B}$ is a finite set of all possible actions that the agent can take, called the base action set. $S_{t}$ and $A_{t}$ are random variables that denote the state of the environment and action chosen by the agent at time $t\in\{0,1,\dotsc\}$ . $\mathcal{P}$ is called the transition function and characterizes how states transition: $\mathcal{P}(s,a,s^{\prime})\coloneqq\Pr(S_{t+1}=s^{\prime}|S_{t}=s,A_{t}=a)$ . $R_{t}\in[-R_{\text{max}},R_{\text{max}}]$ , a bounded random variable, is the scalar reward received by the agent at time $t$ , where $R_{\text{max}}$ is a finite constant. $\mathcal{R}$ is called the reward function, and is defined as $\mathcal{R}(s,a)\coloneqq\mathbb{E}[R_{t}|S_{t}=s,A_{t}=a]$ . The reward discount parameter, $\gamma\in[0,1)$ , characterizes how to utility of rewards to the agent decays based on how far in the future they occur. We call $d_{0}$ the start state distribution, which is defined as $d_{0}(s)\coloneqq\Pr(S_{0}=s)$ .

We now turn to defining a SAS-MDP. Let the set of actions available at time $t$ be a random variable, $\mathcal{A}_{t}\subseteq\mathcal{B}$ , which we assume is always not empty, i.e., $\mathcal{A}_{t}\neq\emptyset$ . Let $\varphi$ characterize the conditional distribution of $\mathcal{A}_{t}$ : $\varphi(s,\alpha)\coloneqq\Pr(\mathcal{A}_{t}=\alpha|S_{t}=s)$ . We assume that $\mathcal{A}_{t}$ is Markovian, in that its distribution is conditionally independent of all events prior to the agent entering state $S_{t}$ given $S_{t}$ . Formally, a SAS-MDP is $\mathcal{M}^{\prime}=\{\mathcal{M}\cup\varphi\}$ , with the additional requirement that $A_{t}\in\mathcal{A}_{t}$ .

A policy $\pi:\mathcal{S}\times 2^{\mathcal{B}}\times\mathcal{B}\to[0,1]$ is a conditional distribution over actions for each state: $\pi(s,\alpha,a)\coloneqq\Pr(A_{t}=a|S_{t}=s,\mathcal{A}_{t}=\alpha)$ for all $s\in\mathcal{S},a\in\alpha,\alpha\subseteq\mathcal{B}$ , and $t$ , where $\alpha\neq\emptyset$ . Sometimes a policy is parameterized by a weight vector $\theta$ , such that changing $\theta$ changes the policy. We write $\pi^{\theta}$ to denote such a parameterized policy with weight vector $\theta$ . For any policy $\pi$ , we define the corresponding state-action value function to be $q^{\pi}(s,a)\coloneqq\mathbb{E}[\sum_{k=0}^{\infty}\gamma^{k}R_{t+k}|S_{t}=s,A_{t}=a,\pi]$ , where conditioning on $\pi$ denotes that $A_{t+k}\sim\pi(S_{t+k},\mathcal{A}_{t+k},\cdot)$ for all $\mathcal{A}_{t+k}$ and $S_{t+k}$ for $k\in[t+1,\infty)$ . Similarly, the state-value function associated with policy $\pi$ is $v^{\pi}(s)\coloneqq\mathbb{E}[\sum_{k=0}^{\infty}\gamma^{k}R_{t+k}|S_{t}=s,\pi]$ . For a given SAS-MDP $\mathcal{M}^{\prime}$ , the agent’s goal is to find an optimal policy, $\pi^{*}$ , (or equivalently optimal policy parameters $\theta^{*}$ ) which is any policy that maximizes the expected sum of discounted future rewards. More formally, an optimal policy is any $\pi^{*}\in\text{argmax}_{\pi\in\Pi}J(\pi)$ , where $J(\pi)\coloneqq\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}R_{t}|\pi]$ and $\Pi$ denotes the set of all possible policies. For notational convenience, we sometimes use $\theta$ in place of $\pi$ , e.g., to write $v^{\theta}$ , $q^{\theta}$ , or $J(\theta)$ , since a weight vector $\theta$ induces a specific policy.

As shown by ? (?), one way to model stochastic action sets using the MDP formulation (rather than the SAS-MDP formulation) is to define states such that one can infer $\mathcal{A}_{t}$ from $S_{t}$ . Transforming an MDP into a new MDP with $\mathcal{A}_{t}$ embedded in $S_{t}$ in this way can result in the size of the state set growing exponentially— by a factor of $2^{|\mathcal{B}|}$ . This drastic increase in the size of the state set can make finding or approximating an optimal policy prohibitively difficult. Using the SAS-MDP formulation, the challenges associated with this exponential increase in the size of the state set can be avoided, and one can derive algorithms for finding or approximating optimal policies in terms of the state set of the original underlying MDP. This is accomplished using a variant of the Bellman operator, $\mathcal{T}$ , which incorporates the concept of stochastic action sets:

[TABLE]

for all $s\in\mathcal{S}$ . Similarly, one can extend the Bellman optimality operator (?):

[TABLE]

? (?) showed that the stationary optimal policies exists for SAS-MDPs and can be represented using (state-specific) decision lists (or orderings/rankings) over the action set. As a policy takes into account the available set of actions, an optimal policy chooses the highest ranked action from those that are available. Building upon these results, ? (?) proposed the following update for a tabular estimate, $q$ , of $q^{\pi^{*}}$ :

[TABLE]

Notice that the maximum is computed only over the available actions, $\mathcal{A}_{t+1}$ , in state $S_{t+1}$ . We refer to the algorithm using this update rule as SAS-Q-learning.

Potential Limitations of SAS-Q-Learning

Although SAS-Q-learning provides a powerful first model-free algorithm for approximating optimal policies for SAS-MDPs, it inherits several of the drawbacks of the Q-learning algorithm for MDPs. Just like Q-learning, in a state $S_{t}$ and with available actions $\mathcal{A}_{t}$ , the SAS-Q-learning method chooses actions deterministically when not exploring: $A_{t}\in\operatorname*{arg\,max}_{a\in\mathcal{A}_{t}}q(S_{t},a)$ . This limits its practicality for problems where optimal policies are stochastic, which is often the case when the environment is partially observable or when the use of function approximation causes state aliasing (?). Additionally, if the SAS-Q-learning update converges to an estimate, $q$ , of $q^{\pi^{*}}$ such that $\mathcal{T}v(s)=v(s)$ for all $s\in\mathcal{S}$ , then the agent will act optimally; however, convergence to a fixed-point of $\mathcal{T}$ is seldom achieved in practice, and reducing the difference between $v(s)$ and $\mathcal{T}v(s)$ (what SAS-Q-learning aims to do) does not ensure improvement of the policy (?).

SAS-Q-learning does not perform gradient ascent or descent on any function, and it can cause divergence of the estimator $q$ when using function approximation, just like Q-learning for MDPs (?). In the setting where all actions are always available, SAS-Q-learning reduces to standard Q-learning. Therefore, for all the cases in this setting where Q-learning is unstable, SAS-Q-learning is also unstable. In the setting where all actions are not always available, there exist additional cases where Q-learning is stable but SAS-Q-learning is not. However, in such cases where Q-learning is stable, its solution might not be particularly useful as it does not incorporate the notion of stochasticity in the action set (Section 8, Fig.2, ? 2018).

To see this, consider the SAS variant of the classical $\theta\rightarrow 2\theta$ MDP (?) illustrated in Figure 1. In this example there are two states, $s_{1}$ (left in Figure 1) and $s_{2}$ (right), and two actions, $a_{1}=\text{left}$ and $a_{2}=\text{right}$ . The agent in this example uses function approximation (?), with weight vector $\theta\in\mathbb{R}^{2}$ , such that $q(s_{1},a_{1})=\theta_{1},q(s_{2},a_{1})=2\theta_{1}$ and $q(s_{1},a_{2})=\theta_{2},q(s_{2},a_{2})=2\theta_{2}$ . In either state, if the agent takes the left action, it goes to the left state, and if the agent takes the right action, it goes to the right state. In our SAS-MDP version of this problem, both actions are not always available. Let $R_{t}=0$ always, and $\gamma=1$ . Consider the case where the weights of the $q$ -approximation are initialized to $\theta=[-2,-5]$ . Now suppose that a transition is observed from the left state to the right state, and after the transition the left action is not available to the agent. As per the SAS-Q-learning update rule provided in (4), $\theta_{2}\leftarrow\theta_{2}+\eta(r+\gamma 2\theta_{2}-\theta_{2}).$ Since $r=0$ and $\gamma=1$ , this is equivalent to $\theta_{2}\leftarrow\theta_{2}+\eta\theta_{2}.$ Considering the off-policy setting where this transition is used repeatedly on its own, then irrespective of the learning rate, $\eta>0$ , the weight $\theta$ would diverge to $-\infty$ . In contrast, had there been no constraint of using max over $q$ given the available actions, the Q-learning update would have been, $\theta_{2}\leftarrow\theta_{2}+\eta(r+\gamma 2\theta_{1}-\theta_{2})$ because action $a_{1}$ has higher q-value than $a_{2}$ due to $\theta_{1}>\theta_{2}$ . This would make $\theta_{2}$ converge to the value $-4$ (the correct answer is [math]).

This provides an example of how the stochastic constraints on the set of available actions can be instrumental in causing the SAS-Q-learning method to diverge, and ignoring the stochastic constraint can prevent Q-learning from converging to the correct solution. We suspect more such cases can be constructed by adapting examples from non-SAS setup ( ? 1995, ? 1996, Chpt 11.2 ? 2018).

Policy Gradient Methods for SAS-MDPs

In this section we provide an alternative to the SAS-Q-learning algorithm by deriving policy gradient algorithms (?) for the SAS-MDP setting. While the Q-learning algorithm minimizes the error between $\mathcal{T}v(s)$ and $v(s)$ for all states $s$ (using a procedure that is not a gradient algorithm), policy gradient algorithms perform stochastic gradient ascent on the objective function $J$ . That is, they use the update $\theta\leftarrow\theta+\eta\Delta$ , where $\Delta$ is an unbiased estimator of $\nabla J(\theta)$ .

Unlike the Q-learning algorithm, policy gradient algorithms for MDPs provide convergence guarantees to a critical point (local/global optima) even when using function approximation, and can approximate optimal stochastic policies. However, ignoring the fact that actions are not always available and using off-the-shelf algorithms for MDPs fails to fully capture the problem setting (?). It is therefore important that we derive policy gradient algorithms that are appropriate for the SAS-MDP setting, as they provide the first convergent model-free algorithms for SAS-MDPs when using function approximation. In the following lemma we extend the expression for the policy gradient for MDPs (?; ?) to handle stochastic action sets.

Lemma 1 (SAS Policy Gradient).

For a SAS-MDP, for all $s\in\mathcal{S}$ ,

[TABLE]

Proof.

See Appendix A.∎

It follows from Lemma 1 that we can create unbiased estimates of $\nabla J(\theta)$ , which can be used to update $\theta$ using the well-known stochastic gradient ascent algorithm. This algorithm is presented in Algorithm 1. Notably, this process does not require the agent to know $\varphi$ . Also, similar to the SAS-Q-learning method, the policy can be parameterized such that it is not required to embed the available actions as a part of the state. One such parameterization is provided in Appendix F. Notice that in the special case where all actions are always available, the expression in Lemma 1 degenerates to the policy gradient theorem for MDPs (?). We now establish that SAS policy gradient algorithms are guaranteed to converge to locally optimal policies under the following standard assumptions,

Assumption A1 (Differentiable).

For any state, action-set, and action triplet $(s,\alpha,a)$ , policy $\pi^{\theta}(s,\alpha,a)$ is continuously differentiable in the parameter $\theta$ .

Assumption A2 (Lipschitz smooth gradient).

Let $\Theta$ denote the set of all possible parameters for policy $\pi^{\theta}$ , then for some constant $L$ ,

[TABLE]

Assumption A3 (Learning rate schedule).

Let $\eta_{\theta}^{t}$ be the learning rate for updating policy parameters $\theta$ , then,

[TABLE]

All the assumptions (A1-A3) are satisfied under standard policy parameterization techniques (linear-function/neural-networks with softmax) and appropriately set learning rates.

Lemma 2.

Under Assumptions (A1)-(A3), the SAS policy gradient algorithm causes $\nabla J(\theta_{t})\to 0$ as $t\to\infty$ , with probability one.

Proof.

See Appendix B.∎

Natural policy gradient algorithms (?) extend policy gradient algorithms to follow the natural gradient of $J$ (?). In essence, whereas policy gradient methods perform gradient ascent in the space of policy parameters by computing the gradient of $J$ as a function of the parameters $\theta$ , natural policy gradient methods perform gradient ascent in the space of policies (which are probability distributions) by computing the gradient of $J$ as a function of the policy, $\pi$ . Thus, whereas policy gradient implicitly measures distances between policies by the Euclidean distance between their policy parameters, natural policy gradient methods measure distances between policies using notions of distance between probability distributions. In the most common form of natural policy gradients, the distances between policies are measured using a Taylor approximation of Kullback-Leibler divergence (KLD). By performing gradient ascent in the space of policies rather than the space of policy parameters, the natural policy gradient becomes invariant to how the policy is parameterized (?), which can help to mitigate the vanishing gradient problem in neural networks and improve learning speed (?).

The natural policy gradient (using a Taylor approximation of KLD to measure distances) is $\widetilde{\nabla}J(\theta)\coloneqq F_{\theta}^{-1}\nabla J(\theta)$ where $F_{\theta}$ is the Fisher information matrix (FIM) associated with the policy $\pi_{\theta}$ . Although the FIM is a well-known quantity, it is typically associated with a parameterized probability distribution. Here, $\pi_{\theta}$ is a collection of probability distributions—one per state. This raises the question of what $F_{\theta}$ should be when computing the natural policy gradient. Following the work of ? (?) for MDPs, we show that the FIM, $F_{\theta}$ , for computing the natural policy gradient for a SAS-MDP can also be derived by viewing $\pi_{\theta}$ as a distribution over possible trajectories (sequences of states, available action sets and executed actions).

Property 1 (Fisher Information Matrix).

For a policy, parameterized using weights $\theta$ , let $\psi^{\theta}(s,\alpha,a)\coloneqq$ $\partial\log\pi^{\theta}(s,\alpha,a)/\partial\theta$ , then the Fisher information matrix is,

[TABLE]

Proof.

See Appendix C.∎

Furthermore, ? (?) showed that many terms in the definition of the natural policy gradient cancel, providing a simple expression for the natural gradient which can be estimated with time linear in the number of policy parameters per time step. We extend the result of ? (?) to the SAS-MDP formulation in the following lemma:

Lemma 3 (SAS Natural Policy Gradient).

Let $w$ be a parameter such that,

[TABLE]

*then for all $s\in\mathcal{S}$ in $\mathcal{M}^{\prime}$ , $\widetilde{\nabla}J(\theta)=w.$ *

Proof.

See Appendix C.∎

From Lemma 3, we can derive a computationally efficient natural policy gradient algorithm by using the well-known temporal difference algorithm (?), modified to work with SAS-MDPs, to estimate $q^{\theta}$ with the approximator $\psi^{\theta}(S_{t},\mathcal{A}_{t},A_{t})^{\top}w$ , and then using the update $\theta\leftarrow\theta+\eta w$ . This algorithm, which is the SAS-MDP equivalent of NAC-TD (?; ?; ?; ?), is provided in Algorithm 2 in Appendix E.

Adaptive Variance Mitigation

In the previous section, we derived (natural) policy gradient algorithms for SAS-MDPs. While these algorithms avoid the divergence of SAS-Q-learning, they suffer from the high variance of policy gradient estimates (?). As a consequence of the additional stochasticity that results from stochastic action sets, this problem can be even more severe in the SAS-MDP setting. In this section, we leverage insights from the Bellman equation for SAS-MDPs, provided in (2), to reduce the variance of policy gradient estimates.

One of the most popular methods to reduce variance is the use of a state-dependent baseline $b(s)$ . ? (?) showed that, for any state-dependent baseline $b(s)$ :

[TABLE]

For any random variables $X$ and $Y$ , we know that the variance of $X-Y$ is given by $\text{var}(X-Y)=\text{var}(X)+\text{var}(Y)-2\text{cov}(X,Y)$ , where cov stands for covariance. Therefore, the variance of $X-Y$ is lesser than variance of $X$ if $2\text{cov}(X,Y)>\text{var}(Y)$ . As a result, any state dependent baseline $b(s)$ whose value is sufficiently correlated to the expected return, $q^{\theta}(s,a)$ , can be used to reduce the variance of the sample estimator of (10). A baseline dependent on both the state and action can have higher correlation with $q^{\theta}(s,a)$ , and could therefore reduce variance further. However, such action dependent baselines cannot be used directly, as they can result in biased gradient estimates. Developing such baselines remains an active area of research for MDPs (?; ?; ?; ?; ?) and is largely complementary to our purpose. Further, even the optimal state-dependent baseline (?), which leads to the minimum variance gradient estimator, is not feasible to compute and only under certain restrictive assumptions reduces to the common choice of state-value function estimator, $\hat{v}(s)$ . Therefore, in the following, we propose multiple baselines that are easy to compute, and then combine them optimally.

We now introduce a baseline for SAS-MDPs that lies between state-dependent and state-action-dependent baselines. Like state-dependent baselines, these new baselines do not introduce bias into gradient estimates. However, like action-dependent baselines these new baselines include some information about the chosen actions. Specifically, we propose baselines that depend on the state, $S_{t}$ , and available action set $\mathcal{A}_{t}$ , but not the precise action, $A_{t}$ .

Recall from the SAS Bellman equation (2) that the state-value function for SAS-MDPs can be written as, $v^{\theta}(s)=\sum_{\alpha\in 2^{\mathcal{B}}}\varphi(s,\alpha)\sum_{a\in\alpha}\pi^{\theta}(s,\alpha,a)q^{\theta}(s,a)$ . While we cannot directly use a baseline dependent on the action sampled from $\pi^{\theta}$ , we can use baseline dependent on the sampled action set. We consider a new baseline which leverages this information about the sampled action set $\alpha$ . This baseline is $\bar{q}(s,\alpha)\coloneqq\sum_{a\in\alpha}\pi^{\theta}(s,\alpha,a)\hat{q}(s,a),$ where $\hat{q}$ is a learned estimator of the state-action value function, and $\bar{q}$ represents its expected value under the current policy, $\pi^{\theta}$ , conditioned on the sampled action set $\alpha$ .

In principle, we expect $\bar{q}(S_{t},\mathcal{A}_{t})$ to be more correlated with $q^{\theta}(S_{t},A_{t})$ as it explicitly conditions on the action set and does not compute an average over all action sets possible, like $\hat{v}$ . Practically, however, estimating $q$ values can be harder than estimating $v$ . This can be attributed to the fact that with the same number of training samples, the number of parameters to learn in $\hat{q}$ is more than those in an estimate of $v^{\theta}$ . This poses a new dilemma of deciding when to use which baseline. To get the best of both, we consider using a weighted combination of $\hat{v}(S_{t})$ and $\bar{q}(S_{t},\mathcal{A}_{t})$ . In the following property we establish that using any weighted combination of these two baselines results in an unbiased estimate of the SAS policy gradient.

Property 2 (Unbiased estimator).

Let $\hat{J}(s,\alpha,a,\theta)\coloneqq\psi^{\theta}(s,\alpha,a)\left(q^{\theta}(s,a)+\lambda_{1}\hat{v}(s)+\lambda_{2}\bar{q}(s,\alpha)\right)$ and $d^{\pi}(s)\coloneqq(1-\gamma)\sum_{t}^{\infty}\gamma^{t}\Pr(S_{t}=s)$ , then for any values of $\lambda_{1}\in\mathbb{R}$ and $\lambda_{2}\in\mathbb{R}$ ,

[TABLE]

Proof.

See Appendix D. ∎

The question remains: what values should be used for $\lambda_{1}$ and $\lambda_{2}$ for combining $\hat{v}$ and $\bar{q}~{}$ ? Similar problems of combining different estimators has been studied in statistics literature (?; ?) and more recently for combining control variates (?; ?). Building upon their ideas, rather than leaving $\lambda_{1}$ and $\lambda_{2}$ as open hyperparameters, we propose a method for automatically adapting $\mathbf{A}=[\lambda_{1},\lambda_{2}]$ for the specific SAS-MDP and current policy parameters, $\theta$ . The following lemma presents an analytic expression for the value of $\mathbf{A}$ that minimizes a sample-based estimate of the variance of $\hat{J}$ .

Lemma 4 (Adaptive variance mitigation).

If $\mathbf{A}=[\lambda_{1},\lambda_{2}]^{\top},$ $\mathbf{B}=[\psi^{\theta}(s,\alpha,a)\hat{v}(s),\psi^{\theta}(s,\alpha,a)\bar{q}(s,\alpha)]^{\top},$ and $\mathbf{C}=[\psi^{\theta}(s,\alpha,a)q^{\theta}(s,a)]^{\top}$ , where $\mathbf{A}\in\mathbb{R}^{2\times 1},\mathbf{B}\in\mathbb{R}^{d\times 2}$ , and $\mathbf{C}\in\mathbb{R}^{d\times 1}$ , then the $\mathbf{A}$ that minimizes the variance of $\hat{J}$ is given by

[TABLE]

Proof.

See Appendix D.∎

Lemma 4 provides the values for $\lambda_{1}$ and $\lambda_{2}$ that result in the minimal variance of $\hat{J}$ . Note that the computational cost associated with evaluating the inverse of $\mathbb{E}\left[\mathbf{B}^{\top}\mathbf{B}\right]$ is negligible because its dimension is always $\mathbb{R}^{2\times 2}$ , independent of the number of policy parameters. Also, Lemma 4 provides the optimal values of $\lambda_{1}$ and $\lambda_{2}$ , which still must be approximated using sample-based estimates of $\mathbf{B}$ and $\mathbf{C}$ . Furthermore, one might use double sampling for $\mathbf{B}$ to get unbiased estimates of the variance minimizing value of $\mathbf{A}$ (?). However, as Property 2 ensures that estimates of $\hat{J}$ for any value of $\lambda_{1}$ and $\lambda_{2}$ are always unbiased, we opt to use all the available samples for estimating $\mathbb{E}[\mathbf{B}^{\top}\mathbf{B}]$ and $\mathbb{E}[\mathbf{B}^{\top}\mathbf{C}]$ .

Algorithm

Pseudo-code for the SAS policy gradient algorithm is provided in Algorithm 1. Let the estimators of $v^{\theta}$ and $q^{\theta}$ be $\hat{v}^{\varpi}$ and $\hat{q}^{\omega}$ , which are parameterized using $\varpi$ and $\omega$ , respectively. Let $\pi^{\theta}$ corresponds to the policy parameterized using $\theta$ . Let $\eta_{\varpi},\eta_{\omega},\eta_{\theta}$ and $\eta_{\lambda}$ be the learning-rate hyper-parameters. We begin by initializing the $\lambda$ values to $-0.5$ each, such that it takes an average of both the baselines and subtracts it off from the sampled return. In Lines $3$ and $4$ , we execute $\pi^{\theta}$ to observe the trajectory and compute the return. Lines $6$ and $7$ correspond to the updates for parameters associated with $\hat{v}^{\varpi}$ and $\hat{q}^{\omega}$ , using their corresponding TD errors (?). The policy parameters are then updated using a combination of both the baselines. We drop the $\gamma^{t}$ dependency for data efficiency (?). As per Lemma 4, for automatically tuning the values of $\lambda_{1}$ and $\lambda_{2}$ , we create the sample estimates of the matrices $\mathbf{B}$ and $\mathbf{C}$ using the transitions from batch $\mathbb{B}$ , in Lines $9$ and $10$ . To update the values of $\lambda$ ’s, we compute $\mathbf{\hat{A}}$ using the sample estimates of $\mathbb{E}[\mathbf{B}^{\top}\mathbf{B}]$ and $\mathbb{E}[\mathbf{B}^{\top}\mathbf{C}]$ . While computing the inverse, a small diagonal noise is added to ensure that inverse exists. As everything is parameterized using smooth function, we know that the subsequent estimates of $\mathbf{A}$ should not vary a lot. Since we only have access to the sample estimate of $\mathbf{A}$ , we leverage the Polyak-Rupert averaging in Line $12$ for stability. Due to space constraints, the algorithm for SAS natural policy gradient is deferred to Appendix E.

Empirical Analysis

In this section we use empirical studies to answer the following three questions: (a) How do our proposed algorithms, SAS policy gradient (SAS-PG) and SAS natural policy gradient (SAS-NPG), compare to the prior method SAS-Q-learning? (b) How does our adaptive variance reduction technique weight the two baselines over the training duration? (c) What impact does the probability of action availability have on the performances of SAS-PG, SAS-NPG, and SAS-Q-learning? To evaluate these aspects, we first briefly introduce three domains inspired by real-world problems.

Routing in San Francisco.

This task models the problem of finding shortest paths in San Francisco, and was first presented with stochastic actions by ? (?). Stochastic actions model the concept that certain paths in the road network may not be available at certain times. A positive reward is provided to the agent when it reaches the destination, while a small penalty is applied at every time step. We modify the domain presented by ? (?) so that the starting state of the agent is not one particular node, but rather is uniformly randomly chosen among all possible locations. This makes the problem more challenging, since it requires the agent to learn the shortest path from every node. All the states (nodes) are discrete, and edges correspond to the action choices. Each edge is made available with some fixed probability. The overall map is shown in Appendix.

Robot locomotion task in a maze.

In this domain, the agent has to navigate a maze using unreliable actuators. The agent starts at the bottom left corner and a goal reward is given when it reaches the goal position, marked by a star (see Appendix for the figure). The agent is penalized at each time step to encourage it to reach the goal as quickly as possible. The state space is continuous, and corresponds to real-valued Cartesian coordinates of the agent’s position. The agent has $16$ actuators pointing in different directions. Turning each actuator on moves the agent in the direction of the actuator. However, each actuator is unreliable, and is therefore only available with some fixed probability.

Product recommender system.

In online marketing and sales, product recommendation is a popular problem. Due to various factors such as stock outage, promotions, delivery issues etc., not all products can be recommended always. To model this, we consider a synthetic setup of providing recommendation to a user from a batch of $100$ products, each available with some fixed probability and associated with a stochastic reward corresponding to profit. Each user has a real-valued context, which forms the state space, and the recommender system interacts with a randomly chosen user for $5$ steps. The goal for the recommender system is to suggest products that maximize total profit. Often the problem of recommendation is formulated as a contextual bandit or collaborative filtering problem, but as shown by ? (?) these approaches fail to capture the long term value of the prediction. Hence we resort to the full RL setup.

Results

Here we only discuss the representative results for the three major questions of interest. Plots for detailed evaluations are available in Appendix F.

(a) For the routing problem in San Francisco, as both the states and actions are discrete, the q-function for each state-action pair has a unique parameter. When no parameters are shared, SAS-Q-learning will not diverge. Therefore, in this domain, we notice that SAS-Q-learning performs similarly to the proposed algorithms. However, in many large-scale problems, the use of function approximators is crucial for estimating the optimal policy. For the robot locomotion task in the maze domain and the recommender system, the state space is not discrete and hence function approximators are required to obtain the state features. As we saw in Section Potential Limitations of SAS-Q-Learning, the sharing of state features can create problems for SAS-Q-learning. The increased variance in the performance of SAS-Q-learning is visible in both the Maze and the Recommender system domains in Figure 2. While the SAS-Q eventually performs the same on the Maze domain, its performance improvement saturates quickly in the recommender system domain thus resulting in a sub-optimal policy.

(b) To provide visual intuition for the behavior of adaptive variance mitigation, we report the values of $\lambda_{1}$ and $\lambda_{2}$ over the training duration in Figure 2. As several factors are combined through (12) to influence the $\lambda$ values, it is hard to pinpoint any individual factor that is responsible for the observed trend. However, note that for both the routing problem in San Francisco and the robot navigation in maze, the goal reward is obtained on reaching the destination and intermediate actions do not impact the total return significantly. Intuitively, this makes the action set conditioned baseline $\bar{q}$ similarly correlated to the observed return as the state only conditioned baseline, $\hat{v}$ , but at the expense of estimating significantly more number of parameters. Thus the importance for $\bar{q}$ is automatically adapted to be closer to zero. On the other hand, in recommender system, each product has a significant amount of associated reward. Therefore, the total return possible during each episode has a strong dependency on the available action set and thus the magnitude of weight for $\bar{q}$ is much larger than that for $v$ .

(c) To understand the impact of the probability of an action being available, we report the best performances for all the algorithms for different probability values in Figure 3. We notice that in the San Francisco routing domain, SAS-Q-learning has a slight edge over the proposed methods. This can be attributed to the fact that off-policy samples can be re-used without causing any divergence problems as state features are not shared. For the maze and the recommender system tasks, where function approximators are necessary, the proposed methods significantly out-perform SAS-Q.

Conclusion

Building upon the SAS-MDP framework of ? (?), we studied an under-addressed problem of dealing with MDPs with stochastic action sets. We highlighted some of the limitations of the existing method and addressed them by generalizing policy gradient methods for SAS-MDPs. Additionally, we introduced a novel baseline and an adaptive variance reduction technique unique to this setting. Our approach has several benefits. Not only does it generalize the theoretical properties of standard policy gradient methods, but it is also practically efficient and simple to implement.

Acknowledgement

The research was supported by and partially conducted at Adobe Research. We are also immensely grateful to the three anonymous reviewers who shared their insights and feedback, specially to the second reviewer who helped improve the counter example.

Reinforcement Learning When All Actions are Not

Always Available (Supplementary Material)

Appendix A A: SAS Policy Gradient

Lemma 1 (SAS Policy Gradient).

For all $s\in\mathcal{S}$ ,

[TABLE]

Proof.

[TABLE]

where (19) comes from unrolling the Bellman equation. We started with the partial derivative of the value of a state, expanded the definition of the value of a state, and obtained an expression in terms of the partial derivative of the value of another state. Now, we again expand $\partial v^{\theta}(s^{\prime})/\partial\theta$ using the definition of the state-value function and the Bellman equation.

[TABLE]

Expanding $\partial v^{\theta}(s^{\prime})/\partial\theta$ allowed us to write it in terms of the partial derivative of yet another state, $s^{\prime\prime}$ . We could continue this process, “unravelling” the recurrence further. Each time that we expand the partial derivative of the value of a state with respect to the parameters, we get another term. The first two terms that we have obtained are marked above. If we were to unravel the expression more times, by expanding $\partial v^{\theta}(s^{\prime\prime})/\partial\theta$ and then differentiating, we would obtain the subsequent third, fourth, etc., terms.

Finally, to get the desired result, we expand the start-state objective and take the derivative with respect to it,

[TABLE]

Combining results from (33) and (34), we index each term by $t$ , with the first term being $t=0$ , the second $t=1$ , etc., which results in the expression:

[TABLE]

Notice that to get the gradient with respect to $J(\theta)$ , we have included a sum over all the states weighted by, $d_{0}(s)$ , the start state probability. When $t=0$ , the only state where $\Pr(S_{0}=s|S_{0}=s,\theta)$ is not zero will be when $s=s$ (at which point this probability is one). This allows us to succinctly represent all the terms. With this we conclude the proof. ∎

Appendix B B: Convergence

Lemma 2.

Under Assumptions (A1)-(A3), SAS policy gradient algorithm causes $\nabla J(\theta_{t})\to 0$ as $t\to\infty$ , with probability one.

Proof.

Following the standard result on convergence of gradient ascent (descent) methods (?), we know that under Assumptions (A1)-(A3), either $J(\theta)\to\infty$ or $\nabla J(\theta)\to 0$ as $t\to\infty$ . However, maximum rewards possible is $R_{\text{max}}$ and $\gamma<1$ , therefore $J(\theta)$ is bounded above by $R_{\text{max}}/(1-\gamma)$ . Hence $J(\theta)$ cannot go to $\infty$ and we get the desired result. ∎

Appendix C C: SAS Natural Policy Gradient

Property 1 (Fisher Information Matrix).

For a policy, parameterized using weights $\theta$ , let $\psi^{\theta}(s,\alpha,a)\coloneqq\partial\log\pi^{\theta}(s,\alpha,a)/\partial\theta$ , then the Fisher information matrix is,

[TABLE]

Proof.

To prove this result, we first note the following relation by ? (?) which connects the Hessian and the FIM of a random variable $X$ parameterized using $\theta$ ,

[TABLE]

Now, let $\mathscr{T}_{\theta}$ denote the random variable corresponding to the trajectories observed using policy $\pi^{\theta}$ . Let $\tau=(s_{0},\alpha_{0},a_{0},s_{1},\alpha_{1},a_{1},...)$ denote an outcome of $\mathscr{T}_{\theta}$ , then the probability of observing this trajectory, $\tau$ , is given by,

[TABLE]

Therefore,

[TABLE]

We know that Fisher Information Matrix for a random variable, which in our case is $\mathscr{T}_{\theta}$ , is given by,

[TABLE]

where the summation over $\mathscr{T}_{\theta}$ corresponds to all possible values of $s,\alpha$ and $a$ for every step $t$ in the trajectory. Expanding the inner summation in (46),

[TABLE]

Note that the summation in (47) over all possible trajectories, i.e. all possible values of $s,\alpha$ and $a$ for every step $t$ , marginalizes out the terms not associated with respective $\log\pi^{\theta}$ terms, i.e.,

[TABLE]

Combining all the terms in (50) and discounting them appropriately with $\gamma$ , we get,

[TABLE]

Finally, note that using (37),

[TABLE]

Combining (51) and (52) we get,

[TABLE]

With this we conclude the proof. ∎

Lemma 3 (SAS Natural Policy Gradient).

Let $w$ be a parameter such that,

[TABLE]

then for all $s\in\mathcal{S}$ in $\mathcal{M}^{\prime}$ ,

[TABLE]

Proof.

We begin by expanding (54),

[TABLE]

Now combining (57) and (60),

[TABLE]

where the second last step follows from Property 1. With this we conclude the proof. ∎

Appendix D D: Adaptive Variance Mitigation

Property 2 (Unbiased estimator).

Let $\hat{J}(s,\alpha,a,\theta)\coloneqq\psi^{\theta}(s,\alpha,a)\left(q^{\theta}(s,a)+\lambda_{1}\hat{v}(s)+\lambda_{2}\bar{q}(s,\alpha)\right)$ and $d^{\pi}(s)\coloneqq(1-\gamma)\sum_{t}^{\infty}\gamma^{t}\Pr(S_{t}=s)$ , then for any values of $\lambda_{1}\in\mathbb{R}$ and $\lambda_{2}\in\mathbb{R}$ ,

[TABLE]

Proof.

We begin by expanding $\nabla J(\theta)$ ,

[TABLE]

Now consider the term associated with the baselines $\hat{v}(s)$ and $\bar{q}$ ,

[TABLE]

Focusing only on the right part of (68),

[TABLE]

Combining (68) and (73), we observe that the bias of this new baseline combination is zero and we get the desired result. ∎

Lemma 4 (Adaptive variance mitigation).

Let

[TABLE]

such that, $\mathbf{A}\in\mathbb{R}^{2\times 1},\mathbf{B}\in\mathbb{R}^{d\times 2}$ and $\mathbf{C}\in\mathbb{R}^{d\times 1}$ , then the $\mathbf{A}$ that minimizes variance of $\hat{J}$ is given by,

[TABLE]

Proof.

Let the sample estimate for the gradient be given by,

[TABLE]

We aim to find the values of $\lambda$ that minimizes the variance of this estimator, i.e.,

[TABLE]

The variance of the estimator can be computed as following,

[TABLE]

From Property 2 we know that,

[TABLE]

Expanding (82) in the matrix notations,

[TABLE]

Since the first and last term from (87) are independent of $\mathbf{A}$ , it does not effect the optimization step. The remaining terms that matter are,

[TABLE]

Differentiating these terms with respect to $\mathbf{A}$ , and by equating it to [math], we get,

[TABLE]

∎

Appendix E E: SAS Natural Policy Gradient

Pseudo-code for SAS natural policy gradient is provided in Algorithm 2. Let the learning-rate for updating $\theta$ and $w$ be given by $\eta_{\theta}$ and $\eta_{w}$ , respectively. Similar to Algorithm 1, we first collect the transition batch $\mathbb{B}$ and compute the sampled returns from each state in Lines $2$ and $3$ . Following Lemma 3, we update the parameter $w$ in Line $5$ to minimize its associated TD error. The updated parameter $w$ is then used to update the policy parameters $\theta$ . As dividing by a scalar does not change the direction of the (natural) gradient, we normalize the update using norm of $w$ in Line $6$ for better stability.

Appendix F F: Empirical Analysis Details

Implementation details

Policy parmaterization.

To make the policy handle stochastic action sets, we make use of a mask which indicates the available actions. Formally, let $\phi(s)\in\mathbb{R}^{d}$ be the feature vector of the state and let $\theta\in\mathbb{R}^{d\times|\mathcal{B}|}$ denote the parameters that project the features on the space of all actions. Let $y\coloneqq\phi(s)^{\top}\theta$ denote the scores for each action and let $\mathds{1}_{\{a\in\alpha\}}$ be the indicator variable denoting whether the action $a$ is in the available action set $\alpha$ or not. The probability of choosing an action is then computed using the masked softmax, i.e.,

[TABLE]

where $y_{a}$ corresponds to the score of action $a$ in $y$ .

Hyperparamter settings.

For the maze domain, state features were represented using $3^{\text{rd}}$ order coupled Fourier basis (?). For the San Francisco map domain, one-hot encoding was used to represent each of the nodes (states) in the road-network. For the recommender system domain, the user-context provided by the environment was directly used as state-features. Using these features, single layer-neural networks were used to represent the policy, baselines and the q-function for all the algorithms, for all the domains. The discounting parameter $\gamma$ was set to $0.99$ for all the domains.

For SAS policy gradient, the learning rates for both the baselines were searched over $[1e-2,1e-4]$ . The learning rate for policy was searched over $[5e-3,5e-5]$ . The hyper-parameter $\eta_{\lambda}$ was kept fixed to $0.999$ throughout. For SAS natural policy gradient, the learning rate, $\eta_{w}$ , was searched over $[1e-2,1e-4]$ .

For SAS-Q-learning baseline, the exploration parameter for $\epsilon$ -greedy was searched over $[0.05,0.15]$ and the Learning rate for the q-function was searched over $[1e-2,1e-4]$ . To encompass both online and batch learning for SAS-Q-learning, additional hyperparameter search was done over the batch-sizes $\{1,8,16\}$ and the number of batches $\{1,8,16\}$ per update to the q-function. Note that when both the batch size and the number of batches is $1$ , it becomes the online version (?).

In total, $1000$ settings for each algorithm, for each domain, were uniformly sampled from the mentioned hyper-parameter ranges/sets. Results from the best performing setting is reported in all the plots. Each hyper-parameter setting was ran using $30$ different seeds to get the standard deviation of the performance.

Additional Experimental Results

In Figures 5 and 6 we report the learning curves and the adapted $\lambda_{1}$ and $\lambda_{2}$ values for all the domains under different probability values of action availability.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[2007] Amari, S.-i., and Nagaoka, H. 2007. Methods of information geometry , volume 191. American Mathematical Soc.
2[1998] Amari, S.-I. 1998. Natural gradient works efficiently in learning. Neural computation 10(2):251–276.
3[2003] Bagnell, J. A., and Schneider, J. G. 2003. Covariant policy search. In IJCAI-03, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence.
4[1995] Baird, L. 1995. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995 . Elsevier. 30–37.
5[2000] Bertsekas, D. P., and Tsitsiklis, J. N. 2000. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization 10(3):627–642.
6[2008] Bhatnagar, S.; Ghavamzadeh, M.; Lee, M.; and Sutton, R. S. 2008. Incremental natural actor-critic algorithms. In Advances in neural information processing systems , 105–112.
7[2018] Boutilier, C.; Cohen, A.; Daniely, A.; Hassidim, A.; Mansour, Y.; Meshi, O.; Mladenov, M.; and Schuurmans, D. 2018. Planning and learning with stochastic action sets. In IJCAI .
8[2012] Degris, T.; Pilarski, P. M.; and Sutton, R. S. 2012. Model-free reinforcement learning with continuous action in practice. In Proceedings of the 2012 American Control Conference .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Reinforcement Learning When All Actions are Not Always Available

Abstract

Introduction

Related Work

Background

Potential Limitations of SAS-Q-Learning

Policy Gradient Methods for SAS-MDPs

Lemma 1** (SAS Policy Gradient).**

Proof.

Assumption A1** (Differentiable).**

Assumption A2** (Lipschitz smooth gradient).**

Assumption A3** (Learning rate schedule).**

Lemma 2**.**

Proof.

Property 1** (Fisher Information Matrix).**

Proof.

Lemma 3** (SAS Natural Policy Gradient).**

Proof.

Adaptive Variance Mitigation

Property 2** (Unbiased estimator).**

Proof.

Lemma 4** (Adaptive variance mitigation).**

Proof.

Algorithm

Empirical Analysis

Routing in San Francisco.

Robot locomotion task in a maze.

Product recommender system.

Results

Conclusion

Acknowledgement

Reinforcement Learning When All Actions are Not

Appendix A A: SAS Policy Gradient

Lemma 1** (SAS Policy Gradient).**

Proof.

Appendix B B: Convergence

Lemma 2**.**

Proof.

Appendix C C: SAS Natural Policy Gradient

Property 1** (Fisher Information Matrix).**

Proof.

Lemma 3** (SAS Natural Policy Gradient).**

Proof.

Appendix D D: Adaptive Variance Mitigation

Property 2** (Unbiased estimator).**

Proof.

Lemma 4** (Adaptive variance mitigation).**

Proof.

Appendix E E: SAS Natural Policy Gradient

Appendix F F: Empirical Analysis Details

Implementation details

Policy parmaterization.

Hyperparamter settings.

Additional Experimental Results

Lemma 1 (SAS Policy Gradient).

Assumption A1 (Differentiable).

Assumption A2 (Lipschitz smooth gradient).

Assumption A3 (Learning rate schedule).

Lemma 2.

Property 1 (Fisher Information Matrix).

Lemma 3 (SAS Natural Policy Gradient).

Property 2 (Unbiased estimator).

Lemma 4 (Adaptive variance mitigation).

Lemma 1 (SAS Policy Gradient).

Lemma 2.

Property 1 (Fisher Information Matrix).

Lemma 3 (SAS Natural Policy Gradient).

Property 2 (Unbiased estimator).

Lemma 4 (Adaptive variance mitigation).