MaMiC: Macro and Micro Curriculum for Robotic Reinforcement Learning

Manan Tomar; Akhil Sathuluri; Balaraman Ravindran

arXiv:1905.07193·cs.LG·May 20, 2019

MaMiC: Macro and Micro Curriculum for Robotic Reinforcement Learning

Manan Tomar, Akhil Sathuluri, Balaraman Ravindran

PDF

TL;DR

MaMiC introduces a dual curriculum approach combining macro and micro strategies to improve robotic manipulation learning with sparse rewards, reducing exploration challenges without complex reward engineering.

Contribution

This work presents a novel dual curriculum scheme for robotic reinforcement learning, integrating macro and micro curricula to enhance learning efficiency and task decomposition.

Findings

01

Improved success rates on Fetch environments.

02

Effective handling of sparse rewards without complex reward shaping.

03

Demonstrated independent utility of macro and micro curricula.

Abstract

Shaping in humans and animals has been shown to be a powerful tool for learning complex tasks as compared to learning in a randomized fashion. This makes the problem less complex and enables one to solve the easier sub task at hand first. Generating a curriculum for such guided learning involves subjecting the agent to easier goals first, and then gradually increasing their difficulty. This paper takes a similar direction and proposes a dual curriculum scheme for solving robotic manipulation tasks with sparse rewards, called MaMiC. It includes a macro curriculum scheme which divides the task into multiple sub-tasks followed by a micro curriculum scheme which enables the agent to learn between such discovered sub-tasks. We show how combining macro and micro curriculum strategies help in overcoming major exploratory constraints considered in robot manipulation tasks without having to…

Equations11

δ_{T D} = r_{t} + γ Q^{'} (s_{t + 1}, π (s_{t + 1})) - Q (s_{t}, a_{t})

δ_{T D} = r_{t} + γ Q^{'} (s_{t + 1}, π (s_{t + 1})) - Q (s_{t}, a_{t})

mi n_{D} V (D) = E_{g \sim p_{d a t a} (g)} [(1 - α) (D (g_{a c hi e v e d}) - 1)^{2} +

mi n_{D} V (D) = E_{g \sim p_{d a t a} (g)} [(1 - α) (D (g_{a c hi e v e d}) - 1)^{2} +

α (D (g_{d es i r e d}) - 1)^{2}] + E_{z \sim p_{z} (z)} [D (G (z))^{2}]

mi n_{G} V (G) = E_{z \sim p_{z} (z)} [(D (G (z)) - 1)^{2}]

mi n_{G} V (G) = E_{z \sim p_{z} (z)} [(D (G (z)) - 1)^{2}]

r_{d e n se} = ∣∣ g_{a c hi e v e d} - g_{d es i r e d} ∣ ∣^{2}

r_{d e n se} = ∣∣ g_{a c hi e v e d} - g_{d es i r e d} ∣ ∣^{2}

r=\left\{\begin{array}[]{l}$0, \quad if receptor on and distance to goal $<$ $r_{threshold}

\\

- 1, o t h er w i se

\end{array}\right.$$

r=\left\{\begin{array}[]{l}$0, \quad if receptor on and distance to goal $<$ $r_{threshold}

\\

- 1, o t h er w i se

\end{array}\right.$$

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\setcopyright

ifaamas \acmDOIdoi \acmISBN \acmConference[AAMAS’19]Proc. of the 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2019), N. Agmon, M. E. Taylor, E. Elkind, M. Veloso (eds.)May 2019Montreal, Canada \acmYear2019 \copyrightyear2019 \acmPrice

\affiliation\institution

1Indian Institute of Technology Madras, Chennai, India

2Robert Bosch Center for Data Science and AI (RBCDSAI), Chennai, India [email protected], [email protected], [email protected]

MaMiC: Macro and Micro Curriculum for Robotic Reinforcement Learning

Manan Tomar1, Akhil Sathuluri1, Balaraman Ravindran1,2

Abstract.

Shaping in humans and animals has been shown to be a powerful tool for learning complex tasks as compared to learning in a randomized fashion. This makes the problem less complex and enables one to solve the easier sub task at hand first. Generating a curriculum for such guided learning involves subjecting the agent to easier goals first, and then gradually increasing their difficulty. This paper takes a similar direction and proposes a dual curriculum scheme for solving robotic manipulation tasks with sparse rewards, called MaMiC. It includes a macro curriculum scheme which divides the task into multiple sub-tasks followed by a micro curriculum scheme which enables the agent to learn between such discovered sub-tasks. We show how combining macro and micro curriculum strategies help in overcoming major exploratory constraints considered in robot manipulation tasks without having to engineer any complex rewards. We also illustrate the meaning of the individual curricula and how they can be used independently based on the task. The performance of such a dual curriculum scheme is analyzed on the Fetch environments.

Key words and phrases:

Reinforcement Learning; Curriculum Learning

1. Introduction

In recent years, deep Reinforcement Learning has seen a lot of promising results in varied domains such as game-playing Mnih et al. (2015), Silver et al. (2016), and continuous control Lillicrap et al. (2016), Schulman et al. (2015), Gu et al. (2017). Despite these developments, robotic decision making remains a hard problem given minimal context of the task in hand Deisenroth et al. (2013). Robotic learning presents a huge challenge mainly because of the complex dynamics, sparse rewards and exploration issues arising from large continuous state spaces, thus providing a good testbed for reinforcement learning algorithms.

Solving complex tasks requires exploiting the structure of the task efficiently. Each task can be viewed as a combination of much simpler prelearnt skills. Consider the cases of unscrewing a bottle or placing an object in a drawer. All such everyday tasks involve reusing distinct skills or sub-policies in an intelligent manner to achieve the overall objective. To be able to solve such complex tasks it is important that we learn in a organized, meaningful manner rather than learning using data collected in a random fashion. Curriculum learning Bengio et al. (2009), Sukhbaatar et al. (2017) is a powerful concept that allows us to come up with such training strategies. Starting to learn for simpler tasks and then using the acquired knowledge to learn progressively harder tasks is a natural outcome of formulating a curriculum. A curriculum assists one in overcoming exploratory constraints of the agent by focusing learning over simpler parts of the state space first. Recently, curriculum learning has been used to solve complex robotic tasks (not necessarily manipulation) such as in Florensa et al. (2017b), Nair et al. (2017). However, these approaches make the assumption that the agent can be reset to any desired state, and also make use of expert state action trajectories Nair et al. (2017), which are expensive to generate. Unlike such techniques, our method is not restricted by the ability to reset. Moreover, we use state-only demonstration sequences for learning only in specific tasks, and do not use demonstrations at all for the other tasks, thus distinguishing our work from those in the imitation learning sphere. Although learning from only state or observation sequences is a much more difficult method Liu et al. (2018), it offers practical benefits in terms of reduced trajectory collection costs and implementation ease, thus fitting our problem domain more accurately.

One way of looking at the problem in hand is to extract sub-goals for a given task, learn sub-policies or skills that achieve these sub-goals, and then execute them in the right order. Such a top-down approach allows exploiting the structure of the problem, since the extracted sub-goals define the nature of the solution. Moreover, we also focus on the sequential nature of the problem, i.e. solving to achieve the first sub-goal, then the second sub-goal and so on. This is important as most robotic locomotion or manipulation problems can be recognized in this manner. In our method, the sub-goal extraction and sequencing is managed by the macro scheme, while learning each sub-policy is managed by the micro scheme. In order to achieve this, both of these methods exhibit and use concepts from curriculum learning.

We introduce MaMiC, comprising macro and micro curriculum, which can be applied either individually or in combination. A micro curriculum essentially generates increasingly complex goals for the agent to achieve. For example, in learning to push a block, initial goals will be generated very near to the block and then slowly shifted to the desired location. However, such a scheme is not sufficient if we need to solve tasks which are more complex, such as ones which require the agent to maintain a particular sequence of sub policies. For instance, in order to put an object in a drawer, it is not enough to guide the agent in learning to put the object to the desired location, but also to open the drawer first. It is only when a particular sequence of such sub policies is followed that we refer to the task as completed. A macro curriculum helps in identifying such a sequence and allows the micro scheme to learn in between this sequence. A policy starts from a achieved sub-goal and proceeds to the next sub-goal, evolving in the process, ultimately reaching the actual goal. Two ideas are at the core of this technique, of being able to discover the sub-goals and of learning between the recognized sub-goals. The working of MaMiC is as illustrated in Fig 1. To summarize, the following are the major contributions of the paper:

•

We propose a dual curriculum strategy comprising micro and macro schemes, which enables an agent to discover sub-goals and learn a policy which evolves to achieve such sub-goals sequentially, eventually solving the task

•

We analyze the macro and micro schemes individually, and illustrate how to combine these individual schemes with base reinforcement learning algorithms such as Deep Deterministic Policy Gradients (DDPG) to solve a given task

•

The performance of the proposed dual curriculum scheme is tested in a Receptor-PickandPlace environment and also in a custom physics environment.

•

An industrial robot with minimal observations available is considered for training and the learnt policy is deployed onto a physical robot as validation. (see the supplementary videos at https://goo.gl/nKZoCQ)

2. Background and Preliminaries

Reinforcement Learning (RL), Sutton and Barto (1998), considers the interaction of an agent with a given environment and is modeled by a Markov Decision Process (MDP), defined by the tuple $\mathcal{\langle S,A,P,\rho_{0}},r\rangle$ , where $\mathcal{S}$ defines the set of states, $\mathcal{A}$ the set of actions, $\mathcal{P:S\times A\rightarrow S}$ the transition function, $\mathcal{\rho_{0}}$ the probability distribution over initial states, and $r\mathcal{:S\times A\rightarrow R}$ the reward function. A policy is denoted by $\pi(s)\mathcal{:S\rightarrow P(A)}$ , where $\mathcal{P(A)}$ defines a probability distribution over actions $a\mathcal{\ \epsilon\ A}$ in a state $s\mathcal{\ \epsilon\ S}$ . The objective is to learn a policy such that the return $R_{t}=\sum_{i=t}^{T}\ \gamma^{(i-t)}\ r(s_{i},a_{i})$ is maximized, where $r(s_{i},a_{i})\$ is the reward function and $\gamma$ is the discount factor.

2.1. Deep Deterministic Policy Gradients (DDPG)

DDPG Lillicrap et al. (2016) is an off policy, model free actor critic based reinforcement learning method. The critic is used to estimate the action value function $Q(s_{t},a_{t})$ , while the actor refers to the deterministic policy of the agent. The critic is learned by minimizing the standard TD error

[TABLE]

,where $Q^{\prime}$ refers to a target network Mnih et al. (2015) which is updated after a fixed number of time steps. The actor is optimized by following the gradient of the critic’s estimate of the $Q$ value. Universal Value Function Approximators (UVFA) Schaul et al. (2015) parameterizes the $Q$ value function by the goal and tries to learn a policy $\pi(s_{t},g_{t})\mathcal{:S\times G}\rightarrow A$ dependent on the goal as well. Such a value function is denoted by $Q(s,a,g)$ .

2.2. Hindsight Experience Replay

Hindsight Experience Replay (HER) was introduced by Andrychowicz et al. (2017) and works along with an off policy method such as DDPG to accelerate the learning process. The overall idea is to learn from unsuccessful trials as well by parameterizing over goals. HER helps in accelerating learning by substituting some samples with the achieved goal instead of the actual goal. Since the current policy is able to reach these achieved goal, learning the mapping between goals to actions becomes faster.

2.3. Goal Generative Adversarial Network

Held et al. (2017) propose using a Generative Adversarial Network (GAN) Goodfellow et al. (2014), Mao et al. (2017) based goal generator for sampling good goals, which refers to goals which are neither too hard nor too easy for the current policy to achieve. The goals used for training the GAN are labeled based on the return obtained for the specific goal. Goals which lead to a positive return are encouraged while those which lead to a negative return are discouraged.

2.4. Definitions

The following are the definitions of the terms used throughout the paper:

Desired Goals : These refer to the actual goals received from the environment and correspond to the task being solved.

Achieved Goals : These refer to end of trajectory states achieved by the agent while following the currently learned policy.

HER Goals: These refer to achieved states in a trajectory while following the currently learned policy, randomly sampled as is proposed by HER .

Micro Goals: These refer to goals generated by the goal generator.

Sub Goals: These refer to the sub-goals extracted from demonstrations or assumed to be given by an oracle.

This work assumes that there exists a mapping $m(g):G\rightarrow S\$ between a goal $g\ \epsilon\ G$ and a state $s\ \epsilon\ S$ . The task then is defined by achieving the corresponding goal state $s_{g}$ for a given goal $g$ . Note that if such a mapping exists, a goal can be achieved by achieving more than one state. Many robotic manipulation tasks are designed such that the goal can be represented as an achievable state, and therefore, such an assumption does not add extreme constraints. In such cases, the achieved goal can be the object’s position and the desired goal can be the target location. Note that the framework adopted in this work does not limit us to only have Cartesian coordinates of objects for defining an achieved goal.

Assumptions about the environment dilute the generalization of an algorithm and lead to failure in unconstrained settings or real-world deployment. In manipulation tasks these can be alleviated by breaking the task into much simpler tasks with lesser constraints, making it easier for the agent to learn. Below are two such assumptions:

•

Resetting the agent to any desired state : In the native reinforcement learning setting, the agent is initialized at particular states based on a start state distribution available only to the environment. However, as mentioned, previous works assume that the agent can be initialized from whichever state is desired. Given this assumption, the agent can start directly from the goal state and thus not learn at all. Such an assumption is extremely limiting as in any practical setting the environment dictates the start state of the agent. The agent should be intelligent enough to reach such desired or favorable states.

•

Starting from solved or partially solved states : Prior work also mentions another technique for learning sparse reward manipulation tasks which involves starting some training trajectories from solved states and the rest by sampling from the start state distribution. For example, in a pushing task, some trajectories start with the object being placed at the target location.

3. Micro Curriculum

A micro curriculum tries to alleviate the above-mentioned assumption of being able to start some trajectories from favorable states. As argued above, we believe that starting at a particular state should be based on the environment’s choice but not the agent’s. We propose replacing all or some transition sample goals with the micro goals which may be generated by any generative modeling technique. Using an off policy RL algorithm allows us to replace sampled transition goals from the buffer with micro goals. The goals are generated such that they are initially close to the achieved states at the end of each trajectory (i.e. the achieved goal distribution) and slowly shift to being closer to the actual or desired goal distribution of the task in hand. Since this procedure involves learning a mapping between goals and actions, eventually the agent is able to generalize well for the actual goal distribution. We relate this with curriculum learning because the agent initially learns for a goal distribution much simpler to learn i.e. the achieved goal distribution and then continues learning for increasingly difficult goals, leveraging the previously learned skills.

To train the goal generator, we make use of Generative Adversarial Networks or GANs Goodfellow et al. (2014) and modify the formulation used by Held et al. (2017). We incorporate an additional parameter $\alpha\ \epsilon\ [0,1]$ which governs the resemblance of the generated distribution to the achieved goal distribution and the actual or desired goal distribution. $\alpha=0$ forces the generator to produce goals similar to the currently achieved states, while $\alpha=1$ produces goals similar to the actual distribution. The exact objective function is given below.

[TABLE]

,where $D$ denotes the discriminator network, $G$ the generator network, and $V$ the GAN value function. $p_{z}$ here is taken as a uniform distribution between 0 and 1 from which the noise vector $z$ is sampled. In all experiments that follow, we choose to update $\alpha$ if the success rate of the currently learned policy for goals generated by the GAN lies above a particular threshold consistently for a few epochs. This essentially tells us that the policy has now mastered achieving the currently generated goals with some degree of confidence and thus the GAN can now shift further towards producing goals resembling the desired distribution.

Algorithm 1 : Micro Curriculum

Given : An off policy RL algorithm $A$ , a goal generator $G$ , a goal sampling strategy $S$ , replay buffer $R$

Initialize $A$ , $R$ , $G$

$\textbf{for}\ n=1,...,N$ episodes do

Sample initial state $s_{0}$ , goal $g$

$\texttt{DesiredGoal}_{n}\leftarrow g$

Generate artificial goal from $G,\ g_{micro}\leftarrow G(z)$ , $z\sim p_{z}$

$\textbf{for}\ t=0,...,\ T-1$ steps do

Compute $a_{t}$ from behavioral policy, $a_{t}\leftarrow\pi_{b}(s_{t},g_{micro})$

Execute $a_{t}$ , observe next state $s_{t+1}$ and compute reward $r(s_{t+1},g_{micro})$

Store transition $(s_{t},a_{t},r_{t},s_{t+1},g_{micro})$ in $R$

$\texttt{AchievedGoal}_{n}\leftarrow s_{T}$

end for

Sample a random minibatch of $N$ transitions $(s_{i},a_{i},r_{i},s_{i+1})$ from $R$

Sample new goals $g^{\prime}$ using $S$

Replace the sampled transitions goals with the new goal $g^{\prime}$ , $(s_{t},a_{t},r_{t},s_{t+1},g^{\prime})$

Recompute reward for replaced goals

$\textbf{for}\ i=1,...,K$ iterations do

Perform one optimization step for Goal Generator using (AchievedGoal, DesiredGoal)

end for

$\textbf{for}\ i=1,...,M$ iterations do

Perform one optimization step of $A$

end for

Algorithm 1. describes our method in detail. At each iteration, the goal generator produces a micro goal which is used to condition the behavior policy and collect samples by executing it. For each episode, the end of trajectory state, called as the achieved goal is collected and stored in memory. While training, a mini batch of data is sampled and some or all of the goal samples are relabelled with new ones using the goal sampling strategy (described below). The achieved goals and the desired goals are used to update the goal generator periodically. The desired goals essentially either are the goals corresponding to the task in hand or any of the sub-goals provided by the sub-goal extraction method. Therefore, this allows the micro scheme to be run independently as well as in combination with the macro method. We elaborate more on this in the below section.

3.1. Strategy for goal sampling

For replacing goals by sampling new ones, we consider different strategies such as having a mixture of HER goals and micro goals (referred to as micro-g), and having a mixture of HER goals and desired goals (referred to as micro-sg).

3.2. Environment Details

•

Pushing : This requires a block placed on a table to be pushed by the end-effector of the robot to a given target.

•

Sliding : In this task, the robot is supposed to hit a puck so that the puck reaches a target location. The target location is given at a position out of the reach of the end-effector, hindering the puck from pushing it continuously towards the target. Instead the agent needs learn to solve the task from a single hit. Overall, We observe that although it is possible to learn a good policy, it is very hard to produce a perfect policy. This can possibly be attributed to the design of the task itself, or the fact that using a very small $r_{threshold}$ for such a hard task in calculating the reward.

•

Pick and Place : This requires the robot agent to pick a box lying on the table and place it at a target location in the air. The gripper is also controlled by the policy in this case, unlike the previous ones. We also do not start any episode with the block already in the robot’s gripper, thus making sure that favorable starts are not considered. Specifically for this task, we consider two sampling strategies for the target location. We denote a uniform strategy to sample target location in the air completely randomly without prioritizing the table. A non-uniform strategy is one in which the target is sampled on the table with probability $0.5$ and in the air with probability $0.5$ .

3.3. Training Details

3.3.1. Goal Generator

We train a GAN on the achieved and desired goals data gather after each rollout. The generator network consists of two 128 nodes layers, while the discriminator consists of two 256 nodes layers. We use a learning rate of 0.001, a batch size of 64, and sample from a noise vector $z$ of size 4. We run 200 training iterations of the GAN after every 100 iterations of the DDPG policy.

3.3.2. sub-goal Extractor

For learning a mapping between start states and sub-goals, we train a 2 layer MLP with 16 nodes each. The input is the start state while the output is the sub-goal i.e. a vector of size 3. The batch size used is 64, and the learning rate is 0.001. We run 1000 training iterations of this extractor for a dataset consisting of 1000 expert trajectory samples. It is observed that having less number of expert trajectories i.e. around 200 does not affect the accuracy by a lot.

3.3.3. Architecture

We run all experiments till 150 epochs on 5 CPU cores. Each epoch consists of 50 cycles. For each cycle 40 training iterations of DDPG are performed. Both the Actor and Critic networks in DDPG are 3 layer MLPs with ReLU non-linearities, 256 nodes each and learning rate as $10^{-3}$ .

3.4. Micro - Tasks Considered

We consider variants of the pushing, sliding and pick and place tasks for a 7 DOF Fetch robot simulation Plappert et al. (2018) as shown in the Fig 3. The sampling strategy S for micro used here comprises HER goals and micro goal samples. We consider three tasks in the Mujoco Todorov et al. (2012) environment for our experiments as described below. A successful trajectory receives a [math] reward while an unsuccessful one receives $-1$ reward. For all three tasks, the target and the object are randomly initialized such that they do not lie in the reward threshold $r_{threshold}=0.05$ , equivalent to 5cm, and therefore the reward received initially is always -1, i.e. we make sure that the agent does not start from a solved state even randomly. We compare our method with the original HER algorithm proposed in Andrychowicz et al. (2017) which is the state-of-the-art algorithm on these domains. Moreover, we also compare with the original DDPG algorithm as a baseline. However, since DDPG fails to solve any of the tasks considered in this paper independently (success rate of almost 0 across all training epochs), we opt to not show these results explicitly in the plots.

3.4.1. Push-hard and Slide-hard tasks

We consider harder variants of the pushing and sliding tasks for testing the micro scheme. These tasks are ”made hard” by ensuring that the object and the target do not lie in similar distributions initially and are far apart from each other for all episode samples. This makes the task difficult to solve as even if the agent somehow learns to push or slide the object to some nearby target site, the task is still not considered solved.

3.4.2. Pick and Place

The task requires an object to be picked and placed at a target site. The target is never sampled on the table and always in the air. We also do not start any episode with the block already in the robot’s gripper, thus making sure that favorable starts are not considered.

3.5. Results

We are able to learn optimal policies for all three tasks. For push-hard and slide-hard tasks, HER is unable to even learn to reach the object as shown in Fig 3. This can be attributed to a mismatch in the kind of goals provided to the parameterized policy and the ones on which the agent learns off-policy. On the other hand, following the micro scheme, we are able to gradually start learning to reach and push / slide the object to nearby generated goals and then gain expertise with respect to the target goals. For Pick and Place, since the goal is always in the air and the object always on the table, a similar mismatch is conceivable.

4. Macro Curriculum

A macro curriculum scheme allows extracting sub-goals by leveraging demonstrated states or observations and sequentially learning the sub-policies for each sub-goal. In the experiments we consider, this implies that learning to achieve the second sub-goal is facilitated by leveraging previous learning of achieving the first sub-goal (learning to push uses already gathered information about learning to reach). We argue that this setting is general enough because each sub-policy itself learns a hard task (the task of reaching) instead of simple ”macro” actions (moving the manipulator continuously in a particular direction). This allows representing the final task policy as comprising each sub-policy. Specifically, we consider long horizon tasks and assume that few demonstration state trajectories $\tau=s_{0},s_{1},...s_{t}$ are available for the given tasks. In general, detecting changes in state representation has been shown to be a good method for extracting sub-goals. This is since system dynamics change suddenly around such sub-goals. In our case, the dense reward (eq. 4) computed per time step for a demonstration is used as the signal for sub-goal extraction. We compute the gradient ratio for such a signal and choose the sub-goal as the state for which consistent spikes are observed. Fig 4 shows the plots for such a dense reward signal in the three tasks considered. The intuition for finding a good sub-goal in a typical manipulation task is to observe that there is a sudden change in the dynamics of the system. For example, if the robot is trying to push a block, it can be easily seen that once the robot explores and starts to interact with the block, the policy will differ as the block interaction dynamics also affect the reward now. For demonstration trajectories, we observe that the gradient ratio of the dense reward always results in consistent spikes near the object position, proving that it is a good sub-goal for learning the three tasks mentioned.

[TABLE]

Learning between two such sub-goals can be performed by following a micro curriculum scheme detailed above. The extracted sub-goals form a set of states that are achieved by most of the sampled expert trajectories. Note that these sub-goals are dependent on the start state. This is because we consider learning over varied goals, thus using goal conditioned policies and not over a single goal state. Given a policy $\pi(s_{t},sg_{t+1})$ that has learnt to achieve a sub-goal $sg_{t}$ allows the agent to achieve the next sub-goal $sg_{t+1}$ by leveraging previous information.

Consider the example of robotic ant navigation where to reach the goal state, the ant needs to collect a key which will open the door to the goal state room. The point we make here is that only using a micro scheme will generate goals between the ants start position and the goal position. However, doing so will result in the ant always jamming against the door with no success in opening it. Since the key lies along another path, through which no micro goals are generated, the agent never learns to open the door. This is where observing an expert and using it to learn that sub-goal lies at the key location becomes relevant. Following this, a micro scheme can be used to learn each sub-policy, that of reaching to the key from the start state and that of reaching to the actual goal state from the key location.

Method 1 : Extract sub-goals

Collect state demonstration trajectories $\tau$

Compute dense reward obtained at each stage, $r_{dense}=\newline \$ (AchievedGoali - DesiredGoal ${}_{i})^{2}$

Compute ratio of gradient of the dense reward, $r_{grad}$ for each state in an expert trajectory

$p\leftarrow$ Normalize $r_{grad}$ in [0, 1]

sub-goals $\leftarrow$ Sample num_subgoals states from each trajectory based on highest probability $p$

$\textbf{for}\ n=1,...,N$ iterations do

Train sub-goal extractor $F$ (sub-goals, start_states)

for end

return $F$

4.1. Macro - Tasks Considered

4.1.1. Receptor-PickAndPlace task

We introduce a new task setting called Receptor-PickandPlace which comprises an object placed on a table, a receptor site on the table, and a target located in the air. As shown in Fig 5, the green and red markers represent the receptor and the goal locations respectively. The agent is required to pick and place the object at a target, which gets activated only if the object passes through the receptor site. Therefore, the agent is not rewarded even if the object is successfully placed at the target, if it does not pass from the receptor site. Such a task becomes extremely difficult to solve because of a sequencing behavior involved and a sparse reward available. We show how combining the macro and micro schemes can solve this task, by 1) leveraging demonstration states to extract a sub-goal near the receptor site and 2) using a powerful micro scheme to realize the sequencing of tasks involved, i.e. first moving the block to the receptor and then to the target.

[TABLE]

4.1.2. Push-far and Slide-far tasks

We also consider variants of the pushing and sliding tasks in which the start state (the gripper position) is considerably far from the table and varied as opposed to the default case where the gripper always starts from a single state and over the table.

4.2. Results

For the Receptor-PickandPlace task, recognizing the receptor as a sub-goal is crucial to learning. There is a significant peak in the dense reward gradient ratio around the receptor location, proving that the sub-goal extraction in the macro scheme is able to leverage demonstrations efficiently. This when combined with a micro scheme is able to learn the sequence of going to the receptor first with the block, thus activating the target, followed by placing it over the target. HER and micro scheme applied individually would fail to learn this task as shown in Fig 5, for different reasons. For HER, the task is too difficult because of the target being quite far and always sampled in the air. With a micro scheme alone, initially we see that the policy learned tries to pick and place the object to targets which are just above the table directly without going over the receptor. However, since the agent is not being rewarded, it quickly diverges to random behavior.

For the Push-far and Slide-far tasks, both MaMiC and HER learn useful policies. Since in these tasks, object and target lie in overlapping distributions, HER is able to perform well as shown in Fig 6. However, please note the extremely high variance in HER, ranging from solving the task in some instances to learning no useful behavior at all in some. This can potentially be attributed to the fact that since the gripper starts at a significantly different part of the state space as the block, learning no longer remains as stable as when the gripper starts over the table and close to the block. MaMiC, on the other hand is able to first learn the reaching sub-task by identifying locations close to the object’s position as good sub-goals and then learns to push the object. Please note that MaMiC provides a clear acceleration in this case and is more stable than HER.

5. Related and Future Work

Konidaris and Barto (2009) exploit the idea of starting from states near the goal and then gradually expanding the starting distribution to learn the overall task. This works because the agent slowly starts to learn how to reach states which are close to the goal. Florensa et al. (2017b) build on this concept and propose a scheme for expanding the start state distribution based on the reward received while starting from such states. However, as mentioned, the usual assumption here is that the agent has the ability to reset to any state, which is not general enough. Moreover, the experiments are shown on tasks having single goal states, and therefore the policy is not generalized for a multiple goal domain such as pick and place. It is not at all trivial to extend this idea for goal parameterized policies as well. Sukhbaatar et al. (2017) also propose an automatic curriculum generation scheme, but work on the assumption that the environment is either reversible or resettable. There have been other works such as McGovern and Barto (2001), Şimşek et al. (2005) which propose different methods for extracting sub-goals. On a higher level, given a sub-goal extraction technique and a function which maps goals to states, our method can work on domains other than robotic manipulation as well. A by-product of an evolving policy, as in our method, is that the sub policies can be saved as learnt options (Sutton et al. (1999), Bacon et al. (2017)) and then used for transfer to tasks which define a different meaning but require similar options. Similar ideas have been reported in Florensa et al. (2017a), Da Silva et al. (2012), where the agent learns a set of skills in a pre training procedure. Such skills are later combined with a master policy which allows for efficient exploration. These works mainly build on a bottom-up approach which restricts the meta-policy required to solve complex tasks to comprise only pre-defined or pre-learnt options.

Since the setting of the algorithm is quite general, there are multiple directions for extending this work. The next challenge is to show how such a technique performs on even more longer horizon tasks, perhaps involving multiple objects as well. Working with image based observations can allow for learning richer representations useful in sub-goal extraction. Moreover, collecting state or observation demonstration trajectories is relatively simpler and more intuitive with images. Considering better heuristics for how $\alpha$ is updated to produce goals closer to the DesiredGoal distribution is an important point to improve upon. Another avenue for future work is to incorporate different schemes of sub-goal extraction which exploit domain specific properties.

6. Conclusion

We introduce a dual curriculum scheme for robotic manipulation which aids in exploration in robotic manipulation tasks with very sparse rewards. We show how the micro scheme is a powerful method for generating goals intelligently and can allow solving hard variants of the pushing, sliding and pick and place tasks without resetting to arbitrary states, starting from favorable states or using expert actions. Moreover, through the Receptor-PickandPlace task, we emphasize on the need for a macro scheme combined with micro when a task involves completing sub-tasks sequentially.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Andrychowicz et al . (2017) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc Grew, Josh Tobin, Open AI Pieter Abbeel, and Wojciech Zaremba. 2017. Hindsight experience replay. In Advances in Neural Information Processing Systems . 5048–5058.
3Bacon et al . (2017) Pierre-Luc Bacon, Jean Harb, and Doina Precup. 2017. The Option-Critic Architecture.. In AAAI . 1726–1734.
4Bengio et al . (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning . ACM, 41–48.
5Da Silva et al . (2012) Bruno Da Silva, George Konidaris, and Andrew Barto. 2012. Learning parameterized skills. ar Xiv preprint ar Xiv:1206.6398 (2012).
6Deisenroth et al . (2013) Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al . 2013. A survey on policy search for robotics. Foundations and Trends® in Robotics 2, 1–2 (2013), 1–142.
7Florensa et al . (2017 a) Carlos Florensa, Yan Duan, and Pieter Abbeel. 2017 a. Stochastic neural networks for hierarchical reinforcement learning. ar Xiv preprint ar Xiv:1704.03012 (2017).
8Florensa et al . (2017 b) Carlos Florensa, David Held, Markus Wulfmeier, and Pieter Abbeel. 2017 b. Reverse curriculum generation for reinforcement learning. ar Xiv preprint ar Xiv:1707.05300 (2017).