# Hierarchical Reinforcement Learning for Concurrent Discovery of Compound   and Composable Policies

**Authors:** Domingo Esteban, Leonel Rozo, Darwin G. Caldwell

arXiv: 1905.09668 · 2019-08-02

## TL;DR

This paper introduces a hierarchical reinforcement learning method that simultaneously learns compound and subtask policies using Gaussian policies and off-policy data, enabling concurrent subtask execution and efficient complex task learning.

## Contribution

It presents a novel algorithm that combines compound and composable policies in a single off-policy RL framework with maximum entropy, facilitating concurrent subtask learning.

## Key findings

- The method effectively solves complex tasks using experience from the compound policy.
- It learns useful subtask policies that perform well independently.
- Concurrent policy learning improves efficiency over sequential approaches.

## Abstract

A common strategy to deal with the expensive reinforcement learning (RL) of complex tasks is to decompose them into a collection of subtasks that are usually simpler to learn as well as reusable for new problems. However, when a robot learns the policies for these subtasks, common approaches treat every policy learning process separately. Therefore, all these individual (composable) policies need to be learned before tackling the learning process of the complex task through policies composition. Moreover, such composition of individual policies is usually performed sequentially, which is not suitable for tasks that require to perform the subtasks concurrently. In this paper, we propose to combine a set of composable Gaussian policies corresponding to these subtasks using a set of activation vectors, resulting in a complex Gaussian policy that is a function of the means and covariances matrices of the composable policies. Moreover, we propose an algorithm for learning both compound and composable policies within the same learning process by exploiting the off-policy data generated from the compound policy. The algorithm is built on a maximum entropy RL approach to favor exploration during the learning process. The results of the experiments show that the experience collected with the compound policy permits not only to solve the complex task but also to obtain useful composable policies that successfully perform in their corresponding subtasks.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.09668/full.md

## Figures

18 figures with captions in the complete paper: https://tomesphere.com/paper/1905.09668/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/1905.09668/full.md

---
Source: https://tomesphere.com/paper/1905.09668