Hierarchical Reinforcement Learning for Concurrent Discovery of Compound   and Composable Policies

Domingo Esteban; Leonel Rozo; Darwin G. Caldwell

arXiv:1905.09668·cs.RO·August 2, 2019

Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies

Domingo Esteban, Leonel Rozo, Darwin G. Caldwell

PDF

1 Repo

TL;DR

This paper introduces a hierarchical reinforcement learning method that simultaneously learns compound and subtask policies using Gaussian policies and off-policy data, enabling concurrent subtask execution and efficient complex task learning.

Contribution

It presents a novel algorithm that combines compound and composable policies in a single off-policy RL framework with maximum entropy, facilitating concurrent subtask learning.

Findings

01

The method effectively solves complex tasks using experience from the compound policy.

02

It learns useful subtask policies that perform well independently.

03

Concurrent policy learning improves efficiency over sequential approaches.

Abstract

A common strategy to deal with the expensive reinforcement learning (RL) of complex tasks is to decompose them into a collection of subtasks that are usually simpler to learn as well as reusable for new problems. However, when a robot learns the policies for these subtasks, common approaches treat every policy learning process separately. Therefore, all these individual (composable) policies need to be learned before tackling the learning process of the complex task through policies composition. Moreover, such composition of individual policies is usually performed sequentially, which is not suitable for tasks that require to perform the subtasks concurrently. In this paper, we propose to combine a set of composable Gaussian policies corresponding to these subtasks using a set of activation vectors, resulting in a complex Gaussian policy that is a function of the means and covariances…

Tables1

Table 1. TABLE I : Environment-specific hyperparameters

	(1)	(2)	(3)	(4)
Units per layer	64	128	128	256
Training steps	$1.5 \cdot 10^{4}$	$1.5 \cdot 10^{5}$	$1.5 \cdot 10^{5}$	$1.5 \cdot 10^{6}$
Size $𝒟$	$5 \cdot 10^{6}$	$5 \cdot 10^{6}$	$5 \cdot 10^{6}$	$1 \cdot 10^{7}$
Size Minibatch	64	256	256	256
$\bar{ℋ}$	0	1	1	1

Equations18

c_{i} = k = 1 \sum K (w_{i}^{[k]} σ_{i}^{[k]})^{2}, m_{i} = k = 1 \sum K w_{i}^{[k]} m_{i}^{[k]},

c_{i} = k = 1 \sum K (w_{i}^{[k]} σ_{i}^{[k]})^{2}, m_{i} = k = 1 \sum K w_{i}^{[k]} m_{i}^{[k]},

c_{i} = (k = 1 \sum K \frac{w _{i}^{[k]}}{( σ _{i}^{[k]} ) ^{2}})^{- 1}, m_{i} = c_{i} (k = 1 \sum K \frac{w _{i}^{[k]}}{( σ _{i}^{[k]} ) ^{2}} m_{i}^{[k]}),

c_{i} = (k = 1 \sum K \frac{w _{i}^{[k]}}{( σ _{i}^{[k]} ) ^{2}})^{- 1}, m_{i} = c_{i} (k = 1 \sum K \frac{w _{i}^{[k]}}{( σ _{i}^{[k]} ) ^{2}} m_{i}^{[k]}),

J_{π} (θ) = J_{π} (θ^{w}; M) + k = 1 \sum K J_{π} (θ^{T}; T^{[k]}),

J_{π} (θ) = J_{π} (θ^{w}; M) + k = 1 \sum K J_{π} (θ^{T}; T^{[k]}),

J({{\pi}^{[j]}})=\sum_{t=0}^{\infty}{\mathbb{E}}_{{\pi}^{[j]}}\left[{\gamma}^{t}\Big{(}r^{[j]}({\mathbf{s}_{t}},{\mathbf{a}_{t}})+\alpha{\mathcal{H}}({\pi}^{[j]}(\cdot|{\mathbf{s}_{t}}))\Big{)}\right]

J({{\pi}^{[j]}})=\sum_{t=0}^{\infty}{\mathbb{E}}_{{\pi}^{[j]}}\left[{\gamma}^{t}\Big{(}r^{[j]}({\mathbf{s}_{t}},{\mathbf{a}_{t}})+\alpha{\mathcal{H}}({\pi}^{[j]}(\cdot|{\mathbf{s}_{t}}))\Big{)}\right]

J_{Q} (ϕ) = E_{(s_{t}, a_{t}) \sim D} [\frac{1}{2} j \sum (Q_{ϕ}^{[j]} (s_{t}, a_{t}) - (r^{[j]} (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim p_{s}} [V_{\overset{ˉ}{ϕ}}^{[j]} (s_{t + 1})]))^{2}]

J_{Q} (ϕ) = E_{(s_{t}, a_{t}) \sim D} [\frac{1}{2} j \sum (Q_{ϕ}^{[j]} (s_{t}, a_{t}) - (r^{[j]} (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim p_{s}} [V_{\overset{ˉ}{ϕ}}^{[j]} (s_{t + 1})]))^{2}]

V_{\overset{ˉ}{ϕ}}^{[j]} (s_{t}) = E_{a_{t} \sim π_{θ}^{[j]}} [Q_{\overset{ˉ}{ϕ}}^{[j]} (s_{t}, a_{t}) - α lo g (π_{θ}^{[j]} (a_{t} ∣ s_{t}))]

V_{\overset{ˉ}{ϕ}}^{[j]} (s_{t}) = E_{a_{t} \sim π_{θ}^{[j]}} [Q_{\overset{ˉ}{ϕ}}^{[j]} (s_{t}, a_{t}) - α lo g (π_{θ}^{[j]} (a_{t} ∣ s_{t}))]

J_{π} (θ^{T}; T^{[k]}) = E_{s_{t} \sim D} [E_{a_{t} \sim π_{θ}^{[k]}} [Q_{ϕ}^{[k]} (s_{t}, a_{t}) - α lo g (π_{θ}^{[k]} (a_{t} ∣ s_{t}))]]

J_{π} (θ^{T}; T^{[k]}) = E_{s_{t} \sim D} [E_{a_{t} \sim π_{θ}^{[k]}} [Q_{ϕ}^{[k]} (s_{t}, a_{t}) - α lo g (π_{θ}^{[k]} (a_{t} ∣ s_{t}))]]

J_{π} (θ^{w}; M) = E_{s_{t} \sim D} [E_{a_{t} \sim π_{θ}^{M}} [Q_{ϕ}^{M} (s_{t}, a_{t}) - α lo g (π_{θ}^{M} (a_{t} ∣ s_{t}))]]

J_{π} (θ^{w}; M) = E_{s_{t} \sim D} [E_{a_{t} \sim π_{θ}^{M}} [Q_{ϕ}^{M} (s_{t}, a_{t}) - α lo g (π_{θ}^{M} (a_{t} ∣ s_{t}))]]

J (α^{[j]}) = E_{a_{t} \sim π^{[j]}} [- α lo g (π_{θ}^{[j]} (a_{t} ∣ s_{t})) + α \overset{ˉ}{H}^{[j]}]

J (α^{[j]}) = E_{a_{t} \sim π^{[j]}} [- α lo g (π_{θ}^{[j]} (a_{t} ∣ s_{t})) + α \overset{ˉ}{H}^{[j]}]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

domingoesteban/hiu_sac
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies

Domingo Esteban1,2, Leonel Rozo3 and Darwin G. Caldwell1 1Department of Advanced Robotics, Istituto Italiano di Tecnologia, Via Morego 30, 16163 Genova, Italy [email protected] and [email protected]2DIBRIS, Università di Genova, Via Opera Pia 13, 16145, Italy3Bosch Center for Artificial Intelligence, Robert-Bosch-Campus 1, 71272 Renningen, Germany. [email protected]

Abstract

A common strategy to deal with the expensive reinforcement learning (RL) of complex tasks is to decompose them into a collection of subtasks that are usually simpler to learn as well as reusable for new problems. However, when a robot learns the policies for these subtasks, common approaches treat every policy learning process separately. Therefore, all these individual (composable) policies need to be learned before tackling the learning process of the complex task through policies composition. Moreover, such composition of individual policies is usually performed sequentially, which is not suitable for tasks that require to perform the subtasks concurrently. In this paper, we propose to combine a set of composable Gaussian policies corresponding to these subtasks using a set of activation vectors, resulting in a complex Gaussian policy that is a function of the means and covariances matrices of the composable policies. Moreover, we propose an algorithm for learning both compound and composable policies within the same learning process by exploiting the off-policy data generated from the compound policy. The algorithm is built on a maximum entropy RL approach to favor exploration during the learning process. The results of the experiments show that the experience collected with the compound policy permits not only to solve the complex task but also to obtain useful composable policies that successfully perform in their corresponding subtasks.

I Introduction

Reinforcement learning (RL) is a general framework that allows an agent to autonomously discover optimal behaviors by interacting with its environment. However, when applied to robotics scenarios some specific challenges arise, such as high-dimensional continuous state-action spaces, real-time requirements, delays and noise in sensing and execution, and expensive (real-world) samples [1]. Recently, several authors have overcome some of these challenges by using deep neural networks (NN) as parameterized policies for generating rich behaviors in end-to-end frameworks [2][3]. Nevertheless, learning the high number of parameters of deep NNs in complex tasks involves a large number of interactions with the environment that compromises the real-world sample challenge.

Several algorithms have been proposed to improve sample efficiency of model-free deep RL by making a better use of the sample information (data-efficiency), obtaining more information from data (sample choice) and improving several times the policy with the same samples (sample reuse) [4]. However, learning a complex robotic task may still be slow or even infeasible when the robot learns from scratch. An appealing approach to deal with this problem is to decompose the task into subtasks that are both simpler to learn and reusable for new problems.

Many tasks in robotics may intuitively be divided into individual tasks, for example, the task of moving an object to a specific location may be decomposed into reaching for the object, grasping it, moving it to the target, and releasing it. Therefore, when a robot is provided with a collection of policies defined in these composable tasks, a new RL problem can be stated as learning how to mix these composable policies such that the performance criterion of the complex task is optimized. Note that shifting the complexity of this task learning problem to layers of simpler functionality is mainly studied in the field of hierarchical RL. However, the assumptions of common hierarchical RL algorithms limit their application in several robotic tasks.

Firstly, several hierarchical RL methods decompose the complex RL problem temporally, meaning that during a certain period of time the behavior of the robot is delegated to a specific subtask policy [5]. This temporal decomposition is suitable for robotic tasks that can be divided sequentially, however, many others require the robot to perform individual tasks concurrently. For example, a manipulator carrying an object and avoiding an obstacle at the same time. Secondly, common learning methods optimize the individual policies and the compound policy through independent single-task RL processes [6][7][8]. Therefore, the robot has to continuously interact with the environment to learn, possibly from scratch, first a collection of composable policies, and only after that their composition, compromising sample efficiency.

Under this scenario, we propose a two-level hierarchical RL approach where a set of Gaussian policies, constituting the low level of the hierarchy, are composed at the high level by means of state-dependent activation vectors defined for each policy. These activation vectors allow to consider concurrently actions sampled from all the low-level policies and preferences among specific components. Furthermore, we propose in section IV two alternatives for obtaining a compound Gaussian policy as a function of the parameters of the low-level policies and their corresponding activation vectors. An algorithm for learning both high- and low-level policies within the same learning process is proposed and detailed in section V.

As an illustration, Figure 1 shows how the proposed algorithm executes actions sampled from the high-level policy, and then exploits this experience for learning simultaneously both low-level policies and activation vectors in an off-policy manner. The results of the experiments detailed in section VI show that the proposed algorithm obtains a high-level policy that solves compound tasks while learning useful low-level policies that can be potentially reused for new tasks. Supplementary videos and code of the proposed approach are available at: https://sites.google.com/view/hrl-concurrent-discovery

II Related Work

Complex problems in RL usually involve either tasks that can be hierarchically organized into subtasks, scenarios requiring concurrent execution of several tasks, and tasks with large or continuous state-action spaces [9]. Hierarchical RL approaches split a complex task into a set of simpler elementary tasks [10]. Some of these methods have been successfully applied in robotics by exploiting temporal abstraction, where the decision to invoke a particular task is not required at each time step but over different time periods [5][11]. As a consequence, these methods assume that a high-level policy, which selects the subtask, and low-level policies, which select the action, are executed in different time scales. In contrast, the approach proposed in this paper considers that the decisions at both levels of the hierarchy are executed at each time step.

The temporal abstraction assumption in most hierarchical RL methods also involves that during a certain period of time the robot only performs a particular task. RL problems requiring policies that solve several tasks at the same time are commonly stated as multiobjective or modular RL problems [12][13]. The policies of all these subtasks may be combined using weights describing the predictability of the environmental dynamics [14], or the values obtained from the desirability function in a linearly-solvable control context [15]. Another alternative is to combine action-value functions of composable tasks, and then extract a policy from this combined function [8]. The latter paper is similar to ours since the composable policies are also optimizing an entropy-augmented RL objective, however, their combination is carried out at the level of action-value functions unlike our policy-based approach. Moreover, their composable policies have been previously learned in independent processes, in comparison to the algorithm proposed here where both compound and composable policies are improved in the same RL process.

Exploiting the experience collected using a particular policy (so-called behavior policy) to concurrently improve several policies is a promising strategy that has been previously explored in literature. The Horde architecture shows how independent RL agents with same state-action spaces can be formulated for solving different problems [16]. This approach proposed by Sutton et al. is relevant because exploits its off-policy formulation for improving all the policies in parallel.

Several works extended the Horde formulation and used deep NN for dealing with high dimensional or continuous state-action spaces, and also to provide a modularity that can be exploited for modeling and training independent RL problems [17][18][19][20]. The Intentional-Unintentional agent [17], for example, shows how deep NN policies and value functions can be learned even with an arbitrarily selected behavior policy. However, this policy can also be chosen at each time step by following a hierarchical objective, and the selection process can be improved as a function of the performance in the complex task [20]. Note that the policies in [17][20] are deterministic, and then the exploration should be carried out by adding a noise generated from an Ornstein-Uhlenbeck process. In contrast, all the policies in this paper are stochastic and the exploration is generated directly from the behavior policy, that is the stochastic high-level policy.

Most of the notation and the approach proposed in this paper are inspired by the Scheduled Auxiliary Control (SAC-X) method [21]. SAC-X solves complex tasks based on a collection of simpler individual tasks, and learns from scratch, both high- and low-level policies simultaneously. However, this method considers temporal abstraction in the hierarchy, and therefore the high-level policy is a scheduler that occasionally selects one low-level policy. Therefore, the policies at the low level of the hierarchy can only be executed sequentially and run at a different time-scale than the policy defined in the high level. This methodology differs from our approach as we consider a framework that is able to execute different policies concurrently.

III Preliminaries

The sequential decision making process of a robot could be modeled by a Markov decision process (MDP) ${\mathcal{M}}$ , defined by the tuple $(\mathcal{S},\mathcal{A},{p_{s}},r)$ , where $\mathcal{S}\subset\mathbb{R}^{{\text{D}_{\mathcal{S}}}}$ and $\mathcal{A}\subset\mathbb{R}^{{\text{D}_{\mathcal{A}}}}$ are continuous state and action spaces of dimensionality ${{\text{D}_{\mathcal{S}}}}$ and ${{\text{D}_{\mathcal{A}}}}$ , respectively. At each time step $t$ , the robot selects an action ${\mathbf{a}_{t}}\in\mathcal{A}$ according to a policy ${\pi}$ which is a function of the current state of the environment ${{\mathbf{s}_{t}}\in\mathcal{S}}$ . After this interaction, the state of the environment changes to ${\mathbf{s}_{t+1}}\in\mathcal{S}$ with a probability density ${p_{s}}=p({\mathbf{s}_{t+1}}|{\mathbf{s}_{t}},{\mathbf{a}_{t}})$ , and the robot receives a reward according to the function ${r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}}$ . The robot’s goal is to maximize the expected infinite-horizon discounted accumulated reward ${G}({\tau})$ of the trajectory ${{\tau}=\{{\mathbf{s}}_{0},{\mathbf{a}}_{0},{\mathbf{s}}_{1},{\mathbf{a}}_{1},\dots\}}$ , induced by its policy ${\pi}$ . That is to say ${J({\pi})={\mathbb{E}}_{{\tau}}\;[{G}({\tau})]={\mathbb{E}}_{{\tau}}\;[\sum_{t=0}^{\infty}{\gamma}^{t}\;r({\mathbf{s}_{t}},{\mathbf{a}_{t}})]}$ .

III-A Maximum Entropy Reinforcement Learning

The exploration required for a robot to generate behaviors that produce high return can be directly obtained by the stochastic policy ${\pi}({\mathbf{a}_{t}}|{\mathbf{s}_{t}})$ , where the randomness of the actions ${\mathbf{a}_{t}}$ given the state ${\mathbf{s}_{t}}$ can be quantified by the entropy of the policy. Maximum entropy reinforcement learning or entropy regularized RL [22][23] is a formulation that augments the previous RL objective by including the entropy of the robot’s policy ${J({\pi})={\mathbb{E}}_{{\tau}}\left[\sum_{t=0}^{\infty}{\gamma}^{t}\Big{(}r({\mathbf{s}_{t}},{\mathbf{a}_{t}})+\alpha{\mathcal{H}}({\pi}(\cdot|{\mathbf{s}_{t}}))\Big{)}\right]}$ , where ${\mathcal{H}}({\pi}(\cdot|{\mathbf{s}_{t}}))$ denotes the entropy of an action ${\mathbf{a}_{t}}$ with distribution ${\pi}({\mathbf{a}_{t}}|{\mathbf{s}_{t}})$ and is computed as ${{\mathcal{H}}({\pi}(\cdot|{\mathbf{s}_{t}}))={\mathbb{E}}_{{\mathbf{a}_{t}}\sim{\pi}}[-\log{\pi}(\cdot|{\mathbf{s}_{t}})]}$ . The parameter $\alpha$ controls the stochasticity of the optimal policy. Note that the conventional RL objective is recovered in the limit $\alpha\rightarrow 0$ .

III-B Soft Actor-Critic algorithm

Soft actor-critic (SAC) is an off-policy actor-critic deep RL algorithm that optimizes stochastic policies defined in the maximum entropy framework [24]. The algorithm is built on a policy iteration formulation that alternates between policy evaluation and policy improvement steps. In the former, a parameterized soft Q-function ${Q_{{\bm{\phi}}}}$ is updated to match the value of the parameterized policy ${{\pi}_{\bm{\theta}}}$ according to the maximum entropy objective, while in the latter the policy ${{\pi}_{\bm{\theta}}}$ is updated towards the exponential of the updated ${Q_{{\bm{\phi}}}}$ . The update rules for the NN parameters ${\bm{\phi}}$ and ${\bm{\theta}}$ are given in section V, and more details of the SAC algorithm can be found in [24][25].

IV Composition of modular Gaussian policies

Acquiring autonomously a robotic skill for a specific task can be achieved by directly optimizing the maximum entropy RL objective with the experience collected from the task execution. Nevertheless, when a new task defined in the same state and action spaces has to be learned, the robot should interact again with the environment to obtain useful data for improving the policy for this new task. Sample efficiency is a key concern in robotics, therefore in this section we propose a two-level hierarchical model that seeks to construct a set of simpler and reusable policies in the low level of the hierarchy, and a high-level policy that combines them for solving a more complex task. In addition, in section V, we will introduce an algorithm for exploiting better the interaction data and learning both low- and high-level policies all together.

IV-A Hierarchical model for composing modular policies

First, let us assume that several complex tasks can be decomposed into a set of $K$ composable tasks ${{\mathcal{T}}=\{{\mathcal{T}}^{[k]}\}_{1}^{K}}$ . All of them have the same state space, action space and transition dynamics, however, each one is characterized by a specific reward function $r^{[k]}({\mathbf{s}_{t}},{\mathbf{a}_{t}})$ . Thus, the corresponding MDP for each composable task ${\mathcal{T}}^{[k]}$ is ${(\mathcal{S},\mathcal{A},{p_{s}},r^{[k]})}$ . Second, let us assume that stochastic policies defined in these MDPs, called composable policies, are conditional Gaussian distributions ${{{\pi}^{[k]}}({\mathbf{a}}|{\mathbf{s}})={\mathcal{N}}(\mathbf{m}^{[k]},\mathbf{C}^{[k]})}$ , with mean vector $\mathbf{m}^{[k]}\in\mathcal{A}$ and diagonal covariance matrix ${\mathbf{C}^{[k]}=\text{diag}[\sigma^{2}_{1},\sigma^{2}_{2},\dots,\sigma^{2}_{{\text{D}_{\mathcal{A}}}}]}$ .

Let us also define a (possibly more complex) compound task ${\mathcal{M}}$ , described by the combination of the tasks in ${\mathcal{T}}$ . As with the composable tasks, ${\mathcal{M}}$ also shares the same state space, action space and transition dynamics, but it is characterized by the reward $r^{{\mathcal{M}}}({\mathbf{s}_{t}},{\mathbf{a}_{t}})$ . Finally, a stochastic policy defined in the resulting MDP ${(\mathcal{S},\mathcal{A},{p_{s}},r^{{\mathcal{M}}})}$ is named compound policy ${{\pi}^{{\mathcal{M}}}}$ . This policy reuses the set of composable policies ${{\Pi}=\{{{\pi}^{[k]}}\}_{1}^{K}}$ defined in ${\mathcal{T}}$ , by using the set ${\mathcal{W}}=\{\mathbf{w}^{[k]}\}_{1}^{K}$ where $\mathbf{w}^{[k]}\in\mathbb{R}^{{\text{D}_{\mathcal{A}}}}$ is an activation vector whose components are used to combine each DoF of the action vector. Therefore, the compound policy is modeled as a two-level hierarchical policy ${{{\pi}^{{\mathcal{M}}}}={\mathtt{f}}({\mathcal{W}},{\Pi})}$ .

The generation process of an action from the compound policy involves first obtaining the actions from ${\Pi}$ in the low level of the hierarchy, and then combining them at the high level of the hierarchy. Instead, we here exploit the assumptions made for the composable policies and propose two formulations of ${\mathtt{f}}$ for obtaining a policy ${{\pi}^{{\mathcal{M}}}}({\mathbf{a}}|{\mathbf{s}})$ that is also conditional Gaussian and defined in terms of the means $\mathbf{m}^{[k]}$ and covariances matrices $\mathbf{C}^{[k]}$ of the composable policies. As a result, an action from the compound policy can be sampled directly from the resulting conditional distribution, ${\mathbf{a}}\sim{{\pi}^{{\mathcal{M}}}}({\mathbf{a}}|{\mathbf{s}})$ .

The first option for ${\mathtt{f}}$ is to consider that the action of the compound policy is the convex combination of elements of actions sampled from the composable policies. As the actions for each composable policy are conditionally independent given the states, and each ${{\pi}^{[k]}}({\mathbf{a}}|{\mathbf{s}})$ is conditional Gaussian, the resulting action is also normally distributed. Therefore, the compound policy is ${{{\pi}^{{\mathcal{M}}}}({\mathbf{a}}|{\mathbf{s}})={\mathcal{N}}(\mathbf{m},\text{diag}(\mathbf{c}))}$ , where the components in $\mathbf{m}$ and $\mathbf{c}$ are computed as

[TABLE]

for all $1\leq i\leq{{\text{D}_{\mathcal{A}}}}$ , with $w_{i}^{[k]}$ , $m_{i}^{[k]}$ , $\sigma_{i}^{[k]}$ as the corresponding elements of the activation vector, mean vector, and standard deviation vector for the composable policy ${\pi}^{[k]}$ .

The second alternative for modeling ${\mathtt{f}}$ is to consider that each component $i$ in the resulting action vector is obtained from a product of conditional Gaussians ${{{\pi}^{{\mathcal{M}}}}(a_{i}|{\mathbf{s}})\propto\prod_{k=1}^{K}({{\pi}^{[k]}}(a_{i}|{\mathbf{s}}))^{w_{i}^{[k]}}}$ . As a result, the compound policy is also ${{{\pi}^{{\mathcal{M}}}}({\mathbf{a}}|{\mathbf{s}})={\mathcal{N}}(\mathbf{m},\text{diag}(\mathbf{c}))}$ where the components in $\mathbf{m}$ and $\mathbf{c}$ are computed as

[TABLE]

for all $1\leq i\leq{{\text{D}_{\mathcal{A}}}}$ , with $w_{i}^{[k]}$ , $m_{i}^{[k]}$ , $\sigma_{i}^{[k]}$ defined as in the first case.

IV-B Hierarchical Policy and Q-functions Modeling

The composable tasks in $\mathcal{T}$ are formulated as independent RL problems, and thus the corresponding policies are also independent. In this line, the mean $\mathbf{m}^{[k]}$ and covariance matrix $\mathbf{C}^{[k]}$ 111More specifically a vector of log standard deviations. that describe a composable Gaussian policy can be obtained from an independent NN. Nevertheless, we can exploit the assumption that all the tasks in ${\mathcal{T}}$ share the same state space, and therefore use shared layers across all the policies for obtaining features that can be exploited by all the policies. The parameters of the resulting NN policy are denoted by ${\bm{\theta}}^{{\mathcal{T}}}$ , and include both the parameters of each NN policy and the shared layers.

Moreover, the function ${\mathtt{h}}$ required for obtaining the state-dependent activation vectors $\{\mathbf{w}^{[k]}\}_{1}^{K}={\mathtt{h}}(s)$ , can also be modeled by an NN with parameters ${\bm{\theta}}^{w}$ . This new NN is included in the previously described NN and thus also exploits the features obtained in the shared layers. Therefore, the whole hierarchical policy is parameterized by ${{\pi}_{\bm{\theta}}}$ , an NN with parameters ${\bm{\theta}}=[{\bm{\theta}}^{{\mathcal{T}}}\quad{\bm{\theta}}^{w}]$ and depicted in Fig. 2(a).

In the same way, an NN with an architecture similar to the hierarchical policy can be used to model the Q-functions of both the composable policies and compound policy. Therefore, the resulting NN, denoted by ${Q_{{\bm{\phi}}}}$ and depicted in Fig. 2(b), has parameters ${\bm{\phi}}=[{\bm{\phi}}^{{\mathcal{T}}}\quad{\bm{\phi}}^{{\mathcal{M}}}]$ .

V Simultaneous Learning and Composition of Modular Maximum Entropy Policies

Most methods learn the composable tasks one at a time, and later, the compound task. This procedure is not scalable as all the experience collected for each learning process is only used for that specific process. Also, it is not possible to start learning more complex tasks unless all the composable policies have been successfully learned. The method proposed in this section is based on the idea that a single stream of experience can be used to improve not only the policy that is generating the behavior but also, indirectly, many other policies. Similar to [17] and [21], our method assumes that the robot receives, at each time step, the rewards for different tasks, and each reward has an assigned policy that tries to maximize its corresponding return $G$ by using the same collection of state-action pairs.

V-A Off-Policy Multi-task Policy Search

The sets of composable policies ${\Pi}$ and activation vectors ${\mathcal{W}}$ required for a compound policy ${{\pi}^{{\mathcal{M}}}}$ to solve task ${\mathcal{M}}$ , can be learned simultaneously. To do so, let us assume that, at each time step, the robot receives a stream of rewards ${\mathbf{r}_{t}=[r_{t}^{[1]}\;\dots\;r_{t}^{[K]}\;r_{t}^{\mathcal{M}}]^{\mathsf{T}}}$ , that is, a vector whose components are the reward $r_{t}^{\mathcal{M}}$ of the compound task ${\mathcal{M}}$ and the reward $r_{t}^{[k]}$ of each composable task in ${\mathcal{T}}$ .

Moreover, the method considers that the behavior or intentional policy, i.e. the policy followed by the robot to interact with the environment, is always the compound policy ${{\pi}^{{\mathcal{M}}}}$ . The experience at each time step $({\mathbf{s}_{t}},{\mathbf{a}_{t}},\mathbf{r}_{t},{\mathbf{s}_{t+1}})$ is collected in the dataset $\mathcal{D}$ , and subsequently used for improving compound and composable policies. As a result, the data in $\mathcal{D}$ is off-policy experience for the composable policies because it is generated from a different policy [26]. Thus, the composable policies (from now on unintentional policies [17]) are target policies in the off-policy setting.

Therefore, by considering the parameterized policy ${{\pi}_{\bm{\theta}}}$ (see section IV-B) with parameters ${{\bm{\theta}}=[{\bm{\theta}}^{{\mathcal{T}}}\quad{\bm{\theta}}^{w}]}$ , the optimal parameters ${\bm{\theta}}^{*}$ are those that maximize the objective

[TABLE]

where $J_{\pi}({\bm{\theta}}^{{\mathcal{T}}};{\mathcal{T}}^{[k]})$ denotes the performance criterion of the composable policy ${\pi}_{{\bm{\theta}}}^{[k]}$ in task $\mathcal{T}^{[k]}$ , and $J_{\pi}({\bm{\theta}}^{w};{\mathcal{M}})$ the performance criterion of the compound policy ${\pi}_{{\bm{\theta}}}^{{\mathcal{M}}}$ in task ${\mathcal{M}}$ .

V-B Multi-task Soft Actor-Critic

As mentioned in section III-A, the maximum entropy objective incentives exploration, which is critical for the introduced method as the composable policies are learned unintentionally and their influence in the sampling process is indirect. Thus, each policy seeks to optimize the maximum entropy objective

[TABLE]

where $r^{[j]}$ is the reward function of the corresponding task $j\in({\mathcal{T}}\cup\{{\mathcal{M}}\})$ .

Considering the SAC algorithm described in section III-B, the learning process for all the aforementioned policies is an alternating procedure of policy evaluation, where the value function is computed for all the policies, and policy update, where the policies are improved with their corresponding value functions. Therefore, at each time step, the parameterized Q-function ${Q_{{\bm{\phi}}}}$ optimizes the soft mean-squared bellman error of all the policies

[TABLE]

for all the tasks $j\in({\mathcal{T}}\cup\{{\mathcal{M}}\})$ , where the value function $V^{[j]}_{\bar{{\bm{\phi}}}}$ is implicitly parameterized through a target soft Q-function with parameters $\bar{{\bm{\phi}}}$ via

[TABLE]

On the other hand, the components of (3) for the policy improvement step of the parameterized policy ${{\pi}_{\bm{\theta}}}$ are defined as

[TABLE]

for each composable task ${\mathcal{T}}^{[k]}$ in ${\mathcal{T}}$ , and

[TABLE]

for the compound task ${\mathcal{M}}$ . Note that the parameters ${\bm{\theta}}^{w}$ required for modeling the activation vectors in ${\mathcal{W}}$ are the only ones updated in the compound policy because, as discussed in [21], there is no guarantee to preserve the unintentional policies. As a consequence, the proposed algorithm improves the parameterized hierarchical policy ${{\pi}_{\bm{\theta}}}$ in a two-step process with random minibatches from ${\mathcal{D}}$ . First, by optimizing ${\bm{\theta}}^{\mathcal{T}}$ with (7) for all the composable tasks. And second, by fixing ${\bm{\theta}}^{\mathcal{T}}$ and optimizing ${\bm{\theta}}^{\mathcal{M}}$ through (8).

As suggested in [27], the practical algorithm considers two soft Q-function NNs with parameters ${\bm{\phi}}_{i}$ trained independently to optimize (5). Furthermore, the algorithm includes a step to calculate $\alpha$ automatically by optimizing

[TABLE]

for $j\in({\mathcal{T}}\cup\{{\mathcal{M}}\})$ . Therefore, the whole training process with the proposed method, called Hierarchical Intentional-Unintentional (HIU), is summarized in Algorithm 1.

VI Experiments

In order to analyze the proposed approach, several experiments were carried out in four robotic tasks that can be intuitively decomposed into simpler tasks. The goal of these experiments is to evaluate if the proposed approach 1) solves an RL problem with a policy that reuses a set of composable policies, and at the same time, 2) obtains composable policies with performance similar to dedicated single-task policies.

VI-A Tasks Description

In the first environment, shown in Fig. 3(a), the agent is a 2D point particle that has to reach the position $(-2,-2)$ . The state of this environment is continuous and defined by the position $(x,y)$ of the particle, and the control actions are its velocities $(\dot{x},\dot{y})$ , then ${{\text{D}_{\mathcal{S}}}}={{\text{D}_{\mathcal{A}}}}=2$ . The initial position of the particle is sampled from a spherical Gaussian distribution centered in the position $(4,4)$ . This task can be decomposed into two composable tasks, namely, reaching the position $-2$ in the $x$ coordinate, and reaching the position $-2$ in the $y$ coordinate. Therefore, the compound policy to reach $(-2,-2)$ has to combine the corresponding composable policies.

The second and third environments correspond a 3-DoF planar manipulator simulated in Pybullet [28], and whose control actions are joint torques, then ${{\text{D}_{\mathcal{A}}}}=3$ . The second environment, shown in 3(b), requires the robot to reach a random goal pose and is described by a state composed of the joint positions and velocities of the robot, and the relative position of the robot end-effector w.r.t the goal, then, ${{\text{D}_{\mathcal{S}}}}=8$ . The third environment, displayed in 3(c), requires the manipulator to reach a cylinder and push it to a target location. In addition to the joint positions and velocities, the state also includes the positions of the end-effector, the cylinder and the goal, then ${{\text{D}_{\mathcal{S}}}}=12$ .

Finally, the proposed approach is also tested in a more complex task, where a simulated CENTAURO robot [29] has to reach a target 3D pose while balancing an object with a tray, as depicted in 3(d). The tray is firmly attached to the robot hand but not the cylinder. The control actions are task joint torques of the right arm, then ${{\text{D}_{\mathcal{A}}}}=7$ . Note that if a zero-torque control action was applied to the joints, the cylinder would fall by its weight. The state in this scenario is composed of the arm joint positions and velocities, pose errors and rate of change of the errors between the target pose and the center of the tray, and between the desired pose of the cylinder in the tray and its current pose, then ${{\text{D}_{\mathcal{S}}}}=38$ .

VI-B Robot Learning Details

The NN models proposed in subsection IV-B are used for learning the four tasks above described. The same architecture is used in all the experiments, namely, NNs with ReLU nonlinearities in the hidden nodes and none in the outputs. However, the number of nodes depends on the task, as summarized in Table I.

The tasks are learned with the algorithm proposed in section V and the following hyperparameters: Adam optimizer with learning rates $\lambda_{Q}=\lambda_{{\pi}}=\lambda_{\alpha}=3\cdot 10^{-4}$ , target smoothing coefficient $\rho=5\cdot 10^{-3}$ , and discount factor ${\gamma}=0.99$ . The NNs are trained using stochastic gradient descent with batches sampled from ${\mathcal{D}}$ after an interaction with the environment. The entropy target $\bar{{\mathcal{H}}}^{[j]}$ is the same for both composable and compound policies, however its value varies for each task. These values and the replay buffer size ${\mathcal{D}}$ for each environment are also depicted in Table I.

The two composition strategies described in section IV are denoted with HIUSAC. The alternative that considers Equation (missing) 1 is denoted by HIUSAC-1, while the solution using Equation (missing) 2 is denoted by HIUSAC-2. For comparison purposes, the SAC algorithm was used to learn both compound and composable policies in single-task RL formulations.

VI-C Results

4(a) shows the learning curves of the composable policies obtained with both HIUSAC-1 and HIUSAC-2 for the 2D particle environment. The achieved performance is similar to that obtained directly in the compound task with the SAC algorithm. The approximated soft Q-values for the velocities (actions) given some specific positions of the particle (state) are depicted in 4(b). Notice how the actions and soft Q-values vary as a function of the position, capturing the specifications of their respective tasks. This is a remarkable result as the composable policies were learned unintentionally with off-policy experience collected only with the compound policy.

The learning curves of the other three environments are displayed in Figure 6. As noted previously, the tasks with the planar manipulator are more complex than the navigation of the 2D particle because the action space, i.e. joint torques, influence directly in the task of reaching the $x$ position and the task of reaching the $y$ position. Therefore, it is more difficult to assign proper activation vectors for the respective composable policies. However, both HIUSAC-1 and HIUSAC-2 obtained successful composable and compound policies, all of them with a performance similar to the policies obtained with the SAC algorithm in single-task RL formulations. However, as we can notice in the composable task 2 for the reaching environment (6(a)), both alternatives require more iterations to converge.

Finally, the results obtained for the task carried out by the CENTAURO robot are reported in 6(c). In this case, the composable policy converges faster and results in higher average returns when compared to the policy obtained in the single-task formulation. This results demonstrate how the complexity of one task can be solved with a collection of simpler subtasks by exploiting a hierarchical off-policy formulation. An interesting result is that the performance of the composable policies for task 2 exceeds that of their single-task counterparts. We attribute this to the fact that the compound policy explores better the environment and therefore the collected experience contains more meaningful information than those obtained in single-task RL formulations. Between HIUSAC-1 and HIUSAC-2, the latter converges faster and results in higher average returns in the compound task and also the composable ones.

VII Conclusions and Future Work

In this paper we have proposed a hierarchical RL approach for tasks that can be decomposed into a collection of subtasks that require to be performed concurrently. The Gaussian policies corresponding to these subtasks are combined using a set of activation vectors. These activation vectors allow to consider concurrently actions sampled from all the low-level policies and preferences among specific components. Furthermore, two methods were proposed to obtain a compound policy that is also Gaussian and a function of the means and covariances matrices of the composable policies.

Moreover, we proposed an algorithm for learning both compound and composable policies within the same learning process by exploiting the off-policy data generated from the compound policy. Note that populating the replay memory buffer with rich experiences is essential for acquiring multiple skills in an off-policy manner. The composable policies learned unintentionally had similar performance than the policies obtained in single-task formulations only when the compound policy was able to efficiently explore the environment. For this reason, the algorithm was built on a maximum entropy RL framework to favor exploration during the learning process.

Nevertheless, choosing the temperature parameters for the maximum entropy RL objective is challenging for the compound policy because its stochasticity is determined by the activation vectors and the stochasticity of all the composable policies. As a result, high temperature values favors higher entropy policies and the compound policy will show preference for composable policies with high stochasticity. However, this preference is made to the detriment of preferring policies with good performance but low entropy. We have used the automating entropy adjustment strategy proposed in [27] that reduces this problem, however now the problem is derived to choose the minimum expected entropy for both composable and compound policies. This value was easier to set for the simpler environments of the experiments, but more challenging for the complex ones. Therefore, future work could be focus in developing a mechanism that allows to obtain this value automatically based on the performance of both compound and composable policies.

On the other hand, in this paper, the action value functions of the composable policies are only used in the policy evaluation step of these policies. However, these models capture the performance of the policies in their respective tasks, and therefore they are important sources of information that could be considered by the compound policy.

Finally, an important motivation for task decomposition is reusing the composable policies in new tasks. Future work will be to analyze the behavior of the composable policies when they are reused in different compound tasks. It is important to know how the performance of new compound policies is affected by composable policies obtained in different contexts. Moreover, the experiments carried out in this paper were focused in tasks that are decomposed in two subtasks. Future work will consider tasks that could be decomposed in more than two subtasks and tasks where the components of the action vector can be assigned completely to specific tasks, e.g. bimanual tasks.

ACKNOWLEDGMENT

The TITAN Xp used for this research was donated by the NVIDIA Corporation.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Peters, D. D. Lee, J. Kober, D. Nguyen-Tuong, J. A. Bagnell, and S. Schaal, “Robot learning,” in Handbook of Robotics , B. Siciliano and O. Khatib, Eds. Springer, 2016, ch. 15, pp. 357–398, 2nd Edition.
2[2] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in International Conference on Learning Representation (ICLR) , 2016.
3[3] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” Journal of Machine Learning Research , vol. 17, no. 1, pp. 1334–1373, 2016.
4[4] O. Sigaud and F. Stulp, “Policy search in continuous action domains: an overview,” ar Xiv preprint ar Xiv:1803.04706 , 2018.
5[5] C. Daniel, G. Neumann, O. Kroemer, and J. Peters, “Hierarchical relative entropy policy search,” Journal of Machine Learning Research , vol. 17, no. 93, pp. 1–50, 2016.
6[6] K. Mülling, J. Kober, and J. Peters, “Learning table tennis with a mixture of motor primitives,” in IEEE-RAS International Conference on Humanoid Robots (Humanoids) , 2010, pp. 411–416.
7[7] T. Osa, J. Peters, and G. Neumann, “Hierarchical reinforcement learning of multiple grasping strategies with human instructions,” Advanced Robotics , vol. 32, no. 18, pp. 955–968, 2018.
8[8] T. Haarnoja, V. Pong, A. Zhou, M. Dalal, P. Abbeel, and S. Levine, “Composable deep reinforcement learning for robotic manipulation,” in IEEE International Conference on Robotics and Automation (ICRA) , 2018, pp. 6244–6251.