Truly Batch Apprenticeship Learning with Deep Successor Features

Donghun Lee; Srivatsan Srinivasan; Finale Doshi-Velez

arXiv:1903.10077·cs.LG·March 26, 2019

Truly Batch Apprenticeship Learning with Deep Successor Features

Donghun Lee, Srivatsan Srinivasan, Finale Doshi-Velez

PDF

TL;DR

This paper presents a new batch apprenticeship learning algorithm that learns an expert's reward structure using only observed data, without requiring a model or additional data collection, and demonstrates its effectiveness on benchmarks and clinical tasks.

Contribution

The paper introduces Deep Successor Feature Networks and a transition-regularized imitation network for off-policy batch apprenticeship learning, enabling reward inference without a dynamics model.

Findings

01

Achieves superior results on control benchmarks.

02

Successfully applied to sepsis management in ICU.

03

Outperforms existing methods in batch settings.

Abstract

We introduce a novel apprenticeship learning algorithm to learn an expert's underlying reward structure in off-policy model-free \emph{batch} settings. Unlike existing methods that require a dynamics model or additional data acquisition for on-policy evaluation, our algorithm requires only the batch data of observed expert behavior. Such settings are common in real-world tasks---health care, finance or industrial processes ---where accurate simulators do not exist or data acquisition is costly. To address challenges in batch settings, we introduce Deep Successor Feature Networks(DSFN) that estimate feature expectations in an off-policy setting and a transition-regularized imitation network that produces a near-expert initial policy and an efficient feature representation. Our algorithm achieves superior results in batch settings on both control benchmarks and a vital clinical task of…

Tables4

Table 1. Table 1: Action Matching Probability : We measured the proportion in which the policy’s predicted action fell in the same discrete bin as the ones empirically taken by clinicians. The performance was measured on the test dataset over three trials. Top-1 matching checks whether policy’s best action matches clinician actions and the Top-3 matching whether the clinician actions are included in the top 3 choices of the policy.

Method	Top-1 matching	Top-3 Matching
DSFN	$79 \pm 5 %$	$90 \pm 3 %$
LSTD-mu	$39 \pm 4 %$	$69 \pm 3 %$
SCIRL	$36 \pm 5 %$	$61 \pm 4 %$
Random	$20 \pm 1 %$	$49 \pm 6 %$
IL (not regularized)	$29 \pm 5 %$	$58 \pm 4 %$

Table 2. Table 2: Sepsis - Action Matching Probability: We measured the proportion in which the policy’s predicted action fell in the same discrete bin as the ones empirically taken by clinicians. The performance was measured on the test dataset over three trials. Top-1 matching checks whether policy’s best action matches clinician actions and the Top-3 matching whether the clinician actions are included in the top 3 choices of the policy.

Method	Top-1 matching	Top-3 Matching
TRIL (regularized)	$80 \pm 2 %$	$91 \pm 1 %$
IL (not regularized)	$29 \pm 5 %$	$58 \pm 4 %$
TRIL + DSFN	$79 \pm 5 %$	$90 \pm 3 %$

Table 3. Table 3: Benchmark Environments: state and action space dimensions on OpenAI gym and Sepsis benchmarks

environment	dim(s)	dim(a)
MountainCar-v0	2	3
Cartpole-v0	4	2
Acrobot-v1	6	3
Sepsis	46	5

Table 4. Table 4: The Hyperparameters of Neural Networks : to train neural networks, we split the demonstration data into training set (70%) and validation set (30%). For the policy network, we found it helpful to establish an isotropic multivariate Gaussian output layer where we output its mean with variable standard deviations for the next state prediction.

Hyperparameters	TRIL	DSFN	DQN
number of hidden layers	2	2	2
hidden node size	128	64	128
max training iterations	50000	50000	30000
activation function	tanh	tanh	tanh
optimizer	Adam	Adam	Adam
adam epsilon	1e-4	1e-4	1e-4
adam learning rate	3e-4	3e-4	3e-4
mini-batch size	64	32	64
$λ$ (regularization)	1.4	-	-
state normalizer	Y	Y	N
prioritized experience replay	N	N	Y
prioritized experience replay alpha	-	-	0.6
prioritized experience replay beta0	-	-	0.9
moving average for target network	-	0.01	0.01
discount rate	0.99	0.99	0.99
stopping condtion (validation)	5e-3	5e-3	1e-2

Equations12

\begin{split}\mu^{\pi}(s_{0},a_{0})&=\phi(s_{0},a_{0})+\mathbb{E}_{\pi}\biggl{[}\sum_{t=1}^{\infty}\gamma^{t}\phi(s_{t},a_{t}\sim\pi)\biggr{]}\\ \mu^{\pi}&=\mathbb{E}_{\mathcal{S}}\biggl{[}\mu^{\pi}(s_{0}\sim\mathcal{S},a_{0}\sim\pi)\biggr{]}\end{split}

\begin{split}\mu^{\pi}(s_{0},a_{0})&=\phi(s_{0},a_{0})+\mathbb{E}_{\pi}\biggl{[}\sum_{t=1}^{\infty}\gamma^{t}\phi(s_{t},a_{t}\sim\pi)\biggr{]}\\ \mu^{\pi}&=\mathbb{E}_{\mathcal{S}}\biggl{[}\mu^{\pi}(s_{0}\sim\mathcal{S},a_{0}\sim\pi)\biggr{]}\end{split}

y_{(s, a, s^{'})}^{π} = {ϕ (s, a) ϕ (s, a) + γ E_{a^{'}} [μ_{θ}^{π} (s^{'}, a^{'})] if s^{'} = s_{T} otherwise

y_{(s, a, s^{'})}^{π} = {ϕ (s, a) ϕ (s, a) + γ E_{a^{'}} [μ_{θ}^{π} (s^{'}, a^{'})] if s^{'} = s_{T} otherwise

\begin{split}\mathcal{L}(\theta,\pi)&=\frac{1}{2}\mathbb{E}_{(s,a,s^{\prime})\sim D_{e}}\big{[}\,\,||\mu_{\theta}^{\pi}(s,a)-y^{\pi}_{(s,a,s^{\prime})}||^{2}\,\,\big{]}\\ &\approx\frac{1}{N_{D_{e}}}\sum_{i=1}^{N_{D_{e}}}||\mu_{\theta}^{\pi}(s_{i},a_{i})-y^{\pi}_{(s_{i},a_{i},s_{i}^{\prime})}||^{2}\\ \nabla_{\theta}\mathcal{L}(\theta,\pi)&=\mathbb{E}_{(s,a,s^{\prime})\sim D_{e}}\big{[}\,\,(\mu_{\theta}^{\pi}(s,a)-y^{\pi}_{(s,a,s^{\prime})})\cdot\nabla\mu_{\theta}^{\pi}(s,a)\,\,\big{]}\end{split}

\begin{split}\mathcal{L}(\theta,\pi)&=\frac{1}{2}\mathbb{E}_{(s,a,s^{\prime})\sim D_{e}}\big{[}\,\,||\mu_{\theta}^{\pi}(s,a)-y^{\pi}_{(s,a,s^{\prime})}||^{2}\,\,\big{]}\\ &\approx\frac{1}{N_{D_{e}}}\sum_{i=1}^{N_{D_{e}}}||\mu_{\theta}^{\pi}(s_{i},a_{i})-y^{\pi}_{(s_{i},a_{i},s_{i}^{\prime})}||^{2}\\ \nabla_{\theta}\mathcal{L}(\theta,\pi)&=\mathbb{E}_{(s,a,s^{\prime})\sim D_{e}}\big{[}\,\,(\mu_{\theta}^{\pi}(s,a)-y^{\pi}_{(s,a,s^{\prime})})\cdot\nabla\mu_{\theta}^{\pi}(s,a)\,\,\big{]}\end{split}

w_{(i)} = w \in R^{d_{w}} min ∣∣ w ∣ ∣^{2}

w_{(i)} = w \in R^{d_{w}} min ∣∣ w ∣ ∣^{2}

s.t. w^{T} μ_{j}^{π} \leq w^{T} μ_{e}^{π} + 1, \forall j \in {1, 2, \dots (i - 1)}

s.t. w^{T} μ_{j}^{π} \leq w^{T} μ_{e}^{π} + 1, \forall j \in {1, 2, \dots (i - 1)}

L (θ_{π_{0}}) = L_{ce} (a, π_{0} (s)) + λ L_{mse} (T_{π_{0}} (s, a), s^{'})

L (θ_{π_{0}}) = L_{ce} (a, π_{0} (s)) + λ L_{mse} (T_{π_{0}} (s, a), s^{'})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Truly Batch Apprenticeship Learning with Deep Successor Features

Donghun Lee ∗1

Srivatsan Srinivasan 111Both the authors contributed equally to this work. 1

Finale Doshi-Velez1 1 Harvard University

{donghunlee,srivatsanstinivasan}@g.harvard.edu, [email protected]

Abstract

We introduce a novel apprenticeship learning algorithm to learn an expert’s underlying reward structure in off-policy model-free batch settings. Unlike existing methods that require a dynamics model or additional data acquisition for on-policy evaluation, our algorithm requires only the batch data of observed expert behavior. Such settings are common in real-world tasks—health care, finance or industrial processes —where accurate simulators do not exist or data acquisition is costly. To address challenges in batch settings, we introduce Deep Successor Feature Networks(DSFN) that estimate feature expectations in an off-policy setting and a transition-regularized imitation network that produces a near-expert initial policy and an efficient feature representation. Our algorithm achieves superior results in batch settings on both control benchmarks and a vital clinical task of sepsis management in the Intensive Care Unit.

1 Introduction

Reward design is a key challenge in Reinforcement Learning (RL). Manually identifying an appropriate reward is often difficult, and poorly specified rewards could lead to serious safety threats Leike et al. (2017). Apprenticeship learning is the process of learning how to act from expert demonstrations. To achieve this, Imitation Learning (IL) algorithms - e.g.Ho and Ermon (2016), directly seek to learn a policy from these demonstrations. However, directly learning a policy can be brittle in cases of long-horizon planning, Piot et al. (2013) environments with strong co-variate shifts and dynamics shifts Fu et al. (2017). In contrast, Inverse Reinforcement Learning (IRL) approaches Abbeel and Ng (2004) aim to recover the expert policy by learning the expert’s underlying reward function and are often more robust. Explicitly learning the expert’s reward function can also inform what the expert wishes to achieve, rather than simply what they are reacting to, enabling agents to understand and generalize these ”intentions” when encountering similar environments.

In this work, we focus on IRL in batch settings: we must infer the reward function that the expert had optimized for, given only a fixed collection of expert demonstrations. Performing analyses on batch data is desirable and often, the only viable alternative in domains such as health care, finance, education, and industrial automation—situations in which pre-collected logs of expert behavior are relatively plentiful but new data acquisition or a policy roll-out is costly or risky.

While there exist many algorithms for IRL in on-policy settings Abbeel and Ng (2004); Ziebart et al. (2008); Fu et al. (2017), IRL in batch settings has additional challenges. Core to many IRL algorithms is the notion of feature expectation, or the expected cumulative ”feature visits” induced by a policy on a given feature space. Assuming a well-engineered feature space, the difference between feature expectations from a candidate policy and an expert policy can be used to improve the estimate of the expert reward functionAbbeel and Ng (2004); In non-batch settings, feature expectations of any proposed candidate policy are computed on transitions collected from its on-policy roll-outs. However, in truly batch settings, we neither have an explicit transition dynamics model nor any ability to acquire new data via on-policy roll-outs. Thus, feature expectations must be estimated off-policy and poor estimates would lead to poor reward updates, rendering IRL ineffective. We also expect batch data to be relatively limited in size and cover a narrow portion of the state-action space and hence, any off-policy estimation algorithms that are sensitive to the distribution of data are expected to generate poor evaluations in truly batch settings.

Our work makes two key contributions that make truly batch IRL viable. First, we introduce a model parametrized by a neural network that estimates feature expectations in a completely off-policy setting, which we term Deep Successor Feature Network (DSFN). Secondly, we introduce Transition Regularized Imitation Learning (TRIL) that warm-starts our IRL algorithm with an effective feature representation and a near-expert policy to ensure that candidate policies evaluated by DSFN are not far-off from the expert policy which yielded our batch data. To our knowledge, our work is the first to provide an effective IRL algorithm that scales well across both simple (e.g. control) and complicated (e.g. clinical treatment) environments in completely batch, off-policy, model-free settings. We demonstrate the effectiveness of our method in benchmark control tasks such as Mountaincar, Cartpole and Acrobot and in a vital clinical problem of managing Sepsis in the Intensive Care Unit.

2 Related Work

Most IRL techniques fall into one of the two categories: margin-based optimization Abbeel and Ng (2004) or probabilistic optimization Ziebart et al. (2008). In this work, we adopt margin-based optimization, which relies on feature expectations,though all our ideas could be adapted for probabilistic-optimization approaches as well.

Feature Expectations and batch IRL

Most IRL works until now have assumed access to a simulator to perform on-policy rollouts Abbeel and Ng (2004); Ziebart et al. (2008); Fu et al. (2017) and relatively few works have considered IRL in a truly batch setting. Like our work, Klein et al. (2011) view estimating feature expectations as a policy evaluation problem. Their work proposes Least-Squares Temporal Difference(LSTD) methods and thus inherits the common weaknesses of least squares estimators - a high sensitivity to basis features and the distribution of training data Lagoudakis and Parr (2003). Klein et al. (2013) proposed Structured Classification IRL (SCIRL) that optimizes reward by setting action value function as a score metric of a multi-class classification problem. While it is simple in formulation, it still requires estimation of feature expectations done in model-free settings via LSTD methods. Contrary to these LSTD based methods in batch settings, our model uses the representation power of neural networks and prioritized experience replay Schaul et al. (2015) in our DSFN to perform off-policy estimations of feature expectations more effectively.

Warm-Starting IRL with features and Initial Policy

In general, learning a good feature space is instrumental in the success of any IRL algorithm and experts may not always be able to comprehensively specify features characterizing an environment Levine et al. (2010). Attempts to learn rich basis features without manual engineering have been made — for instance, using hidden layers of neural networks as latent feature encoders Jin et al. (2015). Our model is built along similar lines to use a TRIL network whose hidden layers automatically provides us a feature transformer for our state-action inputs that are fed into the IRL loop. While imitation learning has evolved largely as a non-RL analogue to IRL for learning from expert demonstrations Ross and Bagnell (2010); Ho and Ermon (2016), works such as Piot et al. (2014) showed the theoretical connections between IRL and IL and proposed a unification framework to help combine advances in these two previously independent domains. Also, other salient IRL works such as Fu et al. (2017) have observed the benefits of warm-starting IRL policies with supervised learning. In a similar vein, our TRIL network learns a good initial policy to warm-start IRL — an indispensable step in batch settings since we have data collected only from the expert policy (Details in Section 4).

3 Background

A. Markov Decision Process :

An MDP is a 5-tuple $(S,A,T,R,\gamma)$ parameterized by (in this work, continuous) states $s\in S$ , (discrete) actions $a\in A$ , transition probabilities $T(s^{\prime}|s,a)$ , the initial state distribution $d(s_{0})$ , reward function $R(s,a)$ , and discount factor $\gamma\in[0,1)$ . A policy $\pi(a|s)$ is a stochastic map that denotes the probability of taking an action $a$ in state $s$ . The value function $V^{\pi}(s)=\mathbb{E}_{\pi}[\sum_{t=0}^{T}\gamma^{t}r(s_{t},a_{t})|s_{0}=s]$ and the action-value function $Q^{\pi}(s,a)=R(s,a)+\mathbb{E}_{\pi}[\sum_{t=1}^{T}\gamma^{t}r(s_{t},a_{t})|s_{0}=s,a_{0}=a]$ measure the quality of states and actions under any policy $\pi$ . Here, $\mathbb{E}_{\pi}$ refers to the expectation under the transition dynamics induced by $\pi$ — $s_{t+1}\sim T(s_{t+1}|s_{t},a_{t}\sim\pi)$ . Finally, $\pi_{e}$ denotes the expert (optimal) policy such that $\pi_{e}=\arg\max_{\pi}V^{\pi}(s),\forall s\in S$ .

B. Max-Margin IRL and Feature Expectations :

We assume that we are given $\mathcal{D}=\{(s_{0},a_{0},...,s_{T})\}$ , a collection of trajectories sampled according to $\pi_{e}$ . In max-margin IRL Abbeel and Ng (2004), we also assume the reward function is linear in some state-action features $R(s,a)=w^{\top}\cdot\phi(s,a)$ where $\phi(s,a)\in\mathbb{R}^{d}$ is a feature map defined over $S\times A$ . The feature expectations $\mu^{\pi}(s,a)$ (also known as a successor feature Barreto et al. (2017)) for a state action pair under any policy $\pi$ is defined as the expected discounted accumulated “feature visitations” induced by $\pi$ . The overall feature expectation $\mu^{\pi}$ is defined as the expected $\mu^{\pi}(s,a)$ over the set of initial states $\mathcal{S}$

[TABLE]

If the reward function is linear in $\phi$ , i.e. $R(s,a)=w^{T}\cdot\phi(s,a)$ , the convergence of our agent’s feature expectations $\mu^{\pi}$ to the expert’s feature expectations $\mu^{\pi}_{e}$ is a sufficient condition for learning a reward structure whose optimal policy matches the expert’s policy. Abbeel and Ng (2004).

4 Method: IRL with Deep Successor Features

While our batch IRL framework is not restricted to one particular IRL algorithm, we adopt max-margin Apprenticeship Learning Abbeel and Ng (2004) as our IRL algorithm in this work 222Note that even the more recent IRL procedures such as adversarial IRL Fu et al. (2017) cannot function without on-policy rollouts to evaluate candidate policies. Future work would involve extending our ideas to more complicated IRL algorithms. In such max-margin algorithms (Algorithm 1), computing the feature expectations (line 2) is a key step to evaluate candidate policies. Most max-margin IRL approaches Abbeel and Ng (2004); Ratliff et al. (2006) assume an ability to perform on-policy roll-outs(using simulators) or the knowledge of model dynamics to collect additional data—both non-existent in batch settings. In this work, our primary aim is to tackle this inability of performing on-policy rollouts(to evaluate policies) and not to introduce any advancements over IRL algorithms that are already successful in non-batch settings.

Inspired by the linear least-squares approach of Klein et al. (2011) to estimate $\mu^{\pi}$ , we interpret the problem of estimating feature expectations in batch settings as an off-policy evaluation problem, drawing a parallel between the feature expectations $\mu^{\pi}$ (Equation 1) as cumulative feature visits and the action value function $Q^{\pi}(s,a)$ (Section 3 A.) as cumulative rewards under a policy $\pi$ . This parallel allows us to leverage advances in off-policy action-value function approximation for feature expectation estimation and thus, in Section 4.1, we introduce Deep Successor Feature Networks (DSFN) as an analogue to Deep-Q networks Mnih et al. (2015) in the feature space.

4.1 Estimating Feature Expectations via Deep Successor Feature Network (DSFN)

Let $D_{e}=\{(s_{i},a_{i},s^{\prime}_{i})\}_{i=1:N_{D_{e}}}$ denote the batch data sampled using $\pi_{e}$ . Define $s_{T}$ as the terminal state. Let $\mu_{\theta}^{\pi}(s,a)$ denote the feature expectation estimator parameterized by a neural network ( $\theta$ ) for an evaluation policy $\pi$ . The aim is to learn $\mu_{\theta}^{\pi}(s,a)\approx\mu^{\pi}(s,a),\forall(s,a)$ and the model is trained using the TD errors from the Bellman equation. Given $\pi,\phi$ , we set the Bellman targets $\forall(s,a,s^{\prime})\in D_{e}$ in Equation 2

[TABLE]

Notice $y^{\pi}_{(s,a,s^{\prime})}$ is specific to $\pi$ and changes with a change in policy. We use mean-square error loss to train our deep successor feature network. For a fixed $\pi,\phi$ , the loss and its gradient $\forall(s,a,s^{\prime})\in D_{e}$ can be calculated as:

[TABLE]

The training procedure is exactly analogous to that of deep Q-learning Mnih et al. (2015) with a subtle difference that DSFN does policy evaluation while DQN does policy optimization. Since we can’t collect additional data in batch settings to estimate the performance of DSFN, we carve out a validation dataset and terminate the training when validation loss $\mathcal{L}_{\text{val}}$ converges under a threshold of $\delta>0$ (Algorithm 2).

Necessity of warm-starting IRL

Notice that the expectation is taken with respect to transitions from $D_{e}\sim\pi_{e}$ in Eqn. (3). This implies that in cases of the candidate poilcy $\pi$ being significantly different from $\pi_{e}$ , the batch data support could be nearly disjoint (i.e. $D\sim\pi,D\cap D_{e}\approx\emptyset$ ). Since one cannot collect additional transitions in batch settings, our gradient updates for $\mu^{\pi}$ would be heavily biased. Consequently, IRL with DSFN may fail to converge. Thus, it is crucial to initialize IRL with a near-expert policy so that $\mu^{\pi}$ can be accurately evaluated on the part of state-action space seen in $D_{e}$ , as opposed to a random policy that most non-batch IRL algorithms typically begin with.

4.2 Warm-starting and Feature Learning via Transition-Regularized Imitation Learning

We propose Transition-Regularized Imitation Learning (TRIL) as a novel batch IL model to obtain a near-expert initial policy while simultaneously deriving a good feature space encoder for the IRL phase. Our TRIL network is a two-channel network jointly trained to predict the expert’s action given state and the system’s next state transition given state and expert action. Other works have shown that combining dynamics and action prediction is useful in a.) learning a good imitation policy Oh et al. (2015) or b.) creating representations that reflect the temporal dynamics of the system Song et al. (2016). In our work, we found that TRIL could be leveraged simultaneously for both engineering an effective feature space and a near-expert initial policy for IRL. Knowing that the joint hidden layers capture key information about expert behavior and system dynamics simultaneously, we use those layers as feature encoders to derive corresponding feature representations $\phi$ for input states in IRL. Also, the policy output by TRIL is fed as $\pi_{0}$ to warm-start Algorithm 1.

The training procedure of TRIL is similar to that of a multi-channel supervised classifier with regularization. Let $\theta_{\pi_{0}}$ be the parameters of TRIL and $L_{\text{ce}}$ be the cross entropy loss for predicting expert’s action and $L_{\text{mse}}$ be the mean squared error loss on predicting next state given current state and the expert’s action assuming we get these samples from demonstration data $D_{e}$ . Let $\lambda$ be the regularization coefficient that controls the strength of the regularization. The network is trained using the following loss: $\forall(s,a,s^{\prime})\in D_{e}$

[TABLE]

Figure 1 presents the full schematic flow of our model that demonstrates the interplay between TRIL and IRL with DSFN. Notice that TRIL learns $\phi(s)$ which can easily be extended to compose $\phi(s,a)$ for a discrete action problem by concatenating one-hot encodings of the actions.

5 Experimental Procedure

Training Details

For our DSFN model, we first trained a TRIL network for warm start. We used a 70-30 training-validation split and following Duan et al. (2016), included a Gaussian output layer that learns the means and standard deviations for the transition prediction — necessary to learn the uncertainty in our highly stochastic clinical domain experiment (Section 7). Further training details in terms of TRIL, DSFN model architecture and hyperparameters are provided in the appendix (Table 4). The IRL update was computed with the max-margin algorithm Abbeel and Ng (2004)(Algorithm 1).

Baselines

We considered two baselines which, to our knowledge, are the only IRL algorithms that are well-designed to operate in completely batch settings. The LSTD- $\mu$ +LSPI baseline Klein et al. (2011) uses Least Squares Temporal Difference (LSTD), a linear model, to approximate estimates of feature expectations ( $\mu^{\pi})$ , and then Least Squares Policy Iteration (LSPI) as the MDP solver. For training the baselines wherever possible, we used the training procedure and model settings provided in the authors’ open source implementation 333 https://github.com/edouardklein/RL-and-IRL. The SCIRL baseline Klein et al. (2012) uses estimated feature expectations as a parameterization of the score function of a multi-class classifier (to predict actions). The parameter vector computed this way defines the reward function of the environment and does not require repetitive solving of the RL problem. To make the comparisons fair, all the algorithms compared were initialized with the same initial policy and feature space (using TRIL).

6 Results: Control Benchmarks

We considered three standard benchmarks: Mountaincar-v0, Cartpole-v0 and Acrobot-v1.444https://github.com/openai/gym In all cases, the optimal policy was first obtained via on-line learning with a DQN Mnih et al. (2015). This policy was used to generate demonstration data of varying number of episodes (1, 10, 100, and 1000) as training batch data input. Once the data was collected, we no longer accessed the simulator or collected any additional data for the entirety of our process (IL and/or IRL); Thus, the experiments conformed to the truly batch, model-free setting.

Our DSFN approach outperformed the baselines with a greater sample-efficiency.

Figure 2 shows the results of our experiments for the three chosen control tasks. Across all tasks and all data regimes, the DSFN model (our model initialized with TRIL) outperforms the baselines, reaching near-expert performance with an order of magnitude less data. We observed that $\text{LSTD-}\mu$ performed poorly because of its strong dependence on the coverage and distribution of the input data, which otherwise leads to an under-determined system. We found SCIRL training to be less reliable because it still depended on LSTD methods and shared the same issues and besides, it was hard to fine-tune the hyper-parameters that constitute SCIRL’s key heuristic.

Our DSFN approach recovers rewards whose optimal policies match experts similarly or better than imitation learning.

In figure 2, we also compare all the IRL approaches whose goal is to recover the reward function, to a pure imitation learner (imitator-TRIL). We see that the baselines lag behind the imitation learner, while DSFN matches or exceeds its performance while doing the much harder and useful task of recovering the reward function. While imitation learning is not an IRL approach—and thus not a direct competitor to DSFN, we included this comparison because it answers the key question of whether the features and the feature expectation approximation are expressive and robust enough to find rewards that could recover the expert policy as well as traditional supervised learning.

7 Results: Sepsis Management in ICU

Sepsis is a leading cause of cost and mortality in Intensive Care Units (ICU), killing 258,000 Americans every year Mervyn et al. (2016). Recently, Raghu et al. (2017) used Deep RL to optimize fluid and vasopressor intervention strategies for patients with sepsis. In our work, we focus on learning the rewards associated with choices of vasopressor administration from clinical demonstrations, as vasopressors are a critical clinical intervention to counter the sepsis which often leads to acute hypotension Mervyn et al. (2016).

We intend to answer a key question in a complicated problem space: \sayWhat are clinicians optimizing for with these vasopressor interventions? Eliciting a full set of considerations from clinicians is hard, making this an ideal domain for IRL. Understanding their motivations has the potential for building better clinical assistant agents as well as understanding whether “true” clinician goals match their stated goals.

7.1 Problem Set-Up

Expert demonstrations and MDP definition

A cohort of 17,898 patients fulfilling Sepsis-3 criteria was obtained from the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC-III v1.4) database Johnson et al. (2016). Our problem setup is similar to the work of Raghu et al. (2017) which aims to derive optimal policies for sepsis treatment from the available batch data. We model the data comprising 46 features (patient vitals and lab measurements + attributes) including important non-vasopressor interventions such as mechanical ventilation and IV fluids at each time-step as our continuous state space i.e. the vector $\textbf{s}\in\mathbb{R}^{46}$ . We work in a discrete action setting where each action amounted to choosing one among 5 vasopressor dosage bins. We consider in-hospital mortality and leaving the ICU (alive) as the absorbing states (More MDP and feature details can be found in the appendix Sections A.1, A.2).

7.2 Results

Imitation: On a real batch IRL task, our DSFN produces approximately 80% action-matching.

Our sepsis dataset was divided into a 80-20 train-test partition. In table 1, we see that the DSFN model achieved significantly higher action-matching rates than the other baselines (setup similar to Section 6). LSTD- $\mu$ , despite heavy tuning, remained sensitive to data distribution and failed to recover good feature expectations as our dataset covered only a narrow portion of the state-action space and SCIRL had similar restrictions because of its dependence on LSTD methods. We note that all algorithms were given the same features and warm-start (from TRIL) for a fair comparison of imitation results.

Interpretability: IRL with DSFN provides insights in line with usual clinical practice.

Action matching is the primary quantitative performance metric to track if we wish to understand whether IRL is finding a reward function consistent with clinical practice. Given that our DSFN model performs relatively well in terms of action-matching, we also focus on the clinical interpretation of the learned rewards in order to verify if the model mimics what the clinicians usually think. Figure 3 shows the learned rewards with respect to three key patient vitals that are usually extreme in patients with septic shock. The dashed line indicates the learned reward for doing no action; the solid line for administering a high dose. It is known that patients with sepsis usually suffer from hypotension, high heart rate and low platelet count555https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1854939/. We see that our model captures this intuition by penalizing the agent for taking no action when the patients suffer from low BP, high heart rate, or low platelet count and on the contrary, rewards the agent for administering high dosage of vasopressors in such extreme scenarios of septic shock. These patterns were also verified to be sensible by clinical experts.

8 Discussion

IRL in fully batch settings—that is, settings where only a limited, previously-collected set of expert demonstrations are available—is challenging: the data has low coverage over the state-action space, making off-policy estimation tricky, and one may also see variation amongst experts (e.g. in clinical settings). Our work is one of the first to identify underlying reward functions that recover the expert policy in real, large-scale batch settings (essential, because otherwise we may be interpreting noise) and then interpret them to start understanding why experts are making the choices they do.

For both the baselines and our TRIL+DSFN, we found that starting with a good initial policy is crucial for the success of batch IRL, especially when the number of expert demonstrations is relatively small. If the initial proposed policy is drastically different from the expert policy, the IRL algorithms do not converge due to errors propagating from the off-policy feature expectation estimates (details in Section 4.1). However, once initialized in the support of the expert trajectories, our IRL loop converged usually in less than five iterations (details in appendix), and as seen in our results, our model - TRIL+DSFN, is much more robust in its ability to recover the expert reward function from that warm start. Finally, we note that the warm-start, which produces a policy that closely matches the expert, is not IRL : our goal is not to simply recover the expert’s policy (a straight-forward supervised problem) but to recover the expert’s reward function. Thus, when DSFN finds a policy similar to the expert policy, it means that it has found a reward function that produces the expert policy under a MDP instead of purely mimicking it.

While the TRIL+DSFN framework we presented for batch-IRL is very generic, we intend to discuss the contribution of certain key modeling and training choices that enhanced our overall performance. We believe that setting up a transition-based regularization channel jointly with action prediction (TRIL) had certain benefits - a.) learning the dynamics guided the imitation of expert action prediction in line with the system’s possible transitions and b.) the hidden layers that were relevant for both the channels provided a feature transformation that effectively encoded the decisions and temporal dynamics of the problem — this rich feature space countered the high sensitivity of max-margin IRL to the quality of features Ratliff et al. (2006). Also, since the sepsis environment is highly stochastic, we wanted our DSFN to be aware of the uncertainty estimates for more robust training, which we achieved through the use of Gaussian output layers and we also normalized the states on a rolling basis to provide a consistent range of input values Henderson et al. (2017).

Broadly, we introduced three key elements that made batch IRL viable - off-policy estimations (DSFN), near-expert initial policy and good feature representations (TRIL) and many IL+IRL algorithms could be fit within this framework. In future, beyond TRIL and DSFN, it would be interesting to explore other methods for identifying feature spaces and warm-starts, as well as other off-policy methods for computing feature expectations—ranging from model-based Herman et al. (2016) to importance sampling-based Thomas and Brunskill (2016)—each of which will have different bias-variance trade-offs. Finally, we note that our innovations can be combined with other IRL algorithms that use feature expectations, e.g. the entropy-based approaches of Ziebart et al. (2008).

9 Conclusion

We introduced a truly batch IRL method that combines deep successor features, an imitation-based initialization and smart representation learning to effectively recover reward functions that underpin the expert demonstrations. Overall, our model was data-efficient, computation-friendly and comfortably outperformed the baselines with limited demonstrations. Few IRL approaches exist for a truly batch setting, and to our knowledge, ours is the first to work reliably for limited expert demonstrations in large-scale chaotic health-care settings which can be extended to vital problems in other domains such as finance, education and industrial automation.

Appendix A Sepsis

Here we share the details for the sepsis management experiment. The features that were chosen with a view to represent represent the most important parameters. Clinicians would examine when deciding treatment and dosage for sepsis patients. The features broadly could be categorized into four groups as below.

A.1 Experimental Details

When several data points were present in one window, appropriate statistics (mean or sum) deemed apt by clinicians were used for aggregation. The trajectories of clinical measurements have no “true” state space, so we modeled the data as coming from a continuous state space that consisted of 46 features, including important non-vasopressor interventions such as mechanical ventilation and IV fluids. We consider in-hospital mortality and leaving the ICU (alive) absorbing states. (Each patient’s treatment trajectory comprises an episode of expert demonstrations for our agent to learn from. Our trajectory lengths are less than or equal to 20 steps (about 80 hours of ICU stay since the data was collated over 4 hour bins). Vasopressor actions were discretized into 5 bins: one bin for no dose and 4 associated with quartiles from data. We used a discount factor $\gamma$ of 0.99. Our goal was to learn a reward function in this MDP that corresponded to expert behavior.

A.2 Patient Features

Index Measures) - Shock Index, Elixhauser, SIRS, Gender, Re-admission, GCS - Glasgow Coma Scale, Age 2. 2.

Lab Values - Albumin, Arterial pH, Calcium, Glucose, Hemoglobin, Magnesium, PTT - Partial Thromboplastin Time, Potassium, SGPT - Serum Glutamic-Pyruvic Transaminase, Arterial Blood Gas, BUN - Blood Urea Nitrogen, Chloride, Bicarbonate, INR - International Normalized Ratio, Sodium, Arterial Lactate, CO2, Creatinine, Ionised Calcium, PT - Prothrombin Time, Platelets Count, SGOT - Serum Glutamic-Oxaloacetic Transaminase, Total bilirubin, White Blood Cell Count 3. 3.

Vital Signs: Diastolic Blood Pressure, Systolic Blood Pressure, Mean Blood Pressure, PaCO2, PaO2, FiO2, Respiratory Rate, Temperature (Celsius), Weight (kg), Heart Rate, SpO2 4. 4.

Intake and Output Events: Fluid Output - 4 hourly period, Total Fluid Output, Mechanical Ventilation, IV Fluids

A.3 Discussion on TRIL

We noticed a significant advantage of having the transition-based regularization (TRIL). As can be seen in Table 2 for sepsis, TRIL outperformed the unregularized baseline. In our sepsis experiment, obtaining an initial policy from TRIL was necessary for DSFN to perform well. DSFN without TRIL did not converge. We think for a task as complex as sepsis management, it is essential to warmstart DSFN with TRIL. We again see the importance of regularization scheme that learns the transition dynamics from its superior performance compared to the unregularized version even though both use the same neural net architecture and training parameters.

For the experiment, we used the same imitation network across all comparisons. While not an IRL approach, it provides a comparison to how well the agent could do if it did not wish to recover a reward function. We found keeping the policy stochastic to be crucial for this task in line with the multivariate Gaussian scheme described in the main paper. We conjecture this is because sepsis is a complicated disease to manage and even today, there is not a strong agreement the optimal dosage even within the clinician community and hence learning uncertainty estimates are useful. Another source of prediction errors could be because of the way we discretized our action space, which might not exactly reflect the buckets of vasopressor dosages that clinicians typically operate with while treating patients.

Appendix B OpenAI Control Benchmarks

Here we share the details for the OpenAI control experiments.

B.1 Alternate Feature Engineering

For LSTD- $\mu$ and SCIRL, we also tried other variants of feature engineering by obtaining basis features using the means and standard deviations of the state samples uniformly sampled from the environment. The performance results obtained for the baselines were in the same range as those tabulated in the main paper and hence we do not state the same again. For MountainCar-v0, we used a Gaussian kernel of 25 components for $\phi(s)$ and subsequently we onehot-encoded $\phi(s)$ based on the 3 actions to represent $\phi(s,a)$ so its dimension becomes 75. For Acrobot-v1 and Cartpole-v0, we used RBF Kernel of 100 components (25 components each $\gamma=0.1,0.5,1.0,5.0$ ).

B.2 Experimental Details

We set the maximum of 10 iterations with two stopping conditions: first is when feature expectation margin at 0.1 and second is when the difference in validation accuracy for action prediction for the two consecutive iterations drops lower than 5%. We found the latter stopping condition to be useful in keeping the training loop stable. Unlike typical inverse reinforcement learning routines, there is no correcting mechanism that’s based on the ground-truth information (typically achieved by on-policy evaluation) and hence, the training loop may diverge in the complete batch apprenticeship learning.

B.3 Neural Network Architectures

The details can be seen in Table 4 in the next page.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abbeel and Ng (2004) Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. page 1. ACM, 2004.
2Barreto et al. (2017) André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems , pages 4055–4065, 2017.
3Duan et al. (2016) Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning , pages 1329–1338, 2016.
4Fu et al. (2017) Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. 2017.
5Henderson et al. (2017) Peter Henderson, Risashat Islam, Philip B Achman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. 2017.
6Herman et al. (2016) Michael Herman, Tobias Gindele, Jörg Wagner, Felix Schmitt, and Wolfram Burgard. Inverse reinforcement learning with simultaneous estimation of rewards and dynamics. In Artificial Intelligence and Statistics , pages 102–110, 2016.
7Ho and Ermon (2016) Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems , pages 4565–4573, 2016.
8Jin et al. (2015) Ming Jin, Andreas Damianou, Pieter Abbeel, and Costas Spanos. Inverse reinforcement learning via deep gaussian process. 2015.