Autonomous Goal Exploration using Learned Goal Spaces for Visuomotor   Skill Acquisition in Robots

Adrien Laversanne-Finot; Alexandre P\'er\'e; Pierre-Yves Oudeyer

arXiv:1906.03967·cs.LG·June 11, 2019

Autonomous Goal Exploration using Learned Goal Spaces for Visuomotor Skill Acquisition in Robots

Adrien Laversanne-Finot, Alexandre P\'er\'e, Pierre-Yves Oudeyer

PDF

Open Access

TL;DR

This paper demonstrates how robots can autonomously learn visuomotor skills by discovering goal spaces through deep learning, enabling efficient skill acquisition without human supervision in real-world settings.

Contribution

It introduces a method for autonomous goal exploration using learned goal spaces directly from robot experience, advancing lifelong learning in robotics.

Findings

01

Successful real-world robotic manipulation of a ball using learned goal spaces

02

Effective autonomous skill discovery without prior engineered features

03

Demonstrated applicability of deep representation learning in robotics

Abstract

The automatic and efficient discovery of skills, without supervision, for long-living autonomous agents, remains a challenge of Artificial Intelligence. Intrinsically Motivated Goal Exploration Processes give learning agents a human-inspired mechanism to sequentially select goals to achieve. This approach gives a new perspective on the lifelong learning problem, with promising results on both simulated and real-world experiments. Until recently, those algorithms were restricted to domains with experimenter-knowledge, since the Goal Space used by the agents was built on engineered feature extractors. The recent advances of deep representation learning, enables new ways of designing those feature extractors, using directly the agent experience. Recent work has shown the potential of those methods on simple yet challenging simulated domains. In this paper, we present recent results showing…

Equations4

lo g L (D) = i = 1 \sum N lo g p_{θ} (x^{i})

lo g L (D) = i = 1 \sum N lo g p_{θ} (x^{i})

L (x; θ, ϕ) = E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - D_{K L} [q_{ϕ} (z ∣ x) ∥ p (z)],

L (x; θ, ϕ) = E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - D_{K L} [q_{ϕ} (z ∣ x) ∥ p (z)],

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Path Planning Algorithms · AI-based Problem Solving and Planning · Formal Methods in Verification

Full text

Autonomous Goal Exploration using Learned Goal Spaces for Visuomotor Skill Acquisition in Robots

Adrien Laversanne-Finot

Flowers Team

Inria and Ensta-ParisTech, France

[email protected]

&Alexandre Péré

Flowers Team

Inria and Ensta-ParisTech, France

[email protected]

\ANDPierre-Yves Oudeyer

Flowers Team

Inria and Ensta-ParisTech, France

[email protected]

Abstract

The automatic and efficient discovery of skills, without supervision, for long-living autonomous agents, remains a challenge of Artificial Intelligence. Intrinsically Motivated Goal Exploration Processes give learning agents a human-inspired mechanism to sequentially select goals to achieve. This approach gives a new perspective on the lifelong learning problem, with promising results on both simulated and real-world experiments. Until recently, those algorithms were restricted to domains with experimenter-knowledge, since the Goal Space used by the agents was built on engineered feature extractors. The recent advances of deep representation learning, enables new ways of designing those feature extractors, using directly the agent experience. Recent work has shown the potential of those methods on simple yet challenging simulated domains. In this paper, we present recent results showing the applicability of those principles on a real-world robotic setup, where a 6-joint robotic arm learns to manipulate a ball inside an arena, by choosing goals in a space learned from its past experience.

1 Introduction

Despite recent breakthroughs in artificial intelligence, learning agents often remain limited to tasks predefined by human engineers. The autonomous discovery and simultaneous learning of many tasks in an open world remains challenging for reinforcement learning algorithms. However, discovering autonomously the set of outcomes that can be produced by acting in an environment is of paramount importance for learning agents. This is essential to acquire world models and repertoires of parameterized skills (Baranes & Oudeyer, 2013; Da Silva et al., 2014; Hester & Stone, 2017) or to efficiently bootstrap exploration for deep reinforcement learning problems with rare or deceptive rewards (Conti et al., 2017; Colas et al., 2018b). In order to discover as many diverse outcomes as possible, the learner should be able to self-organize its exploration curriculum in order to discover efficiently the possible outcomes that can be produced in its environment.

When aiming at discovering autonomously what outcomes can be produced by a physical robot, a naive exploration of the space of motor commands is bound to fail. Indeed, the space of motor commands is often continuous and high-dimensional. Secondly, this space is also highly redundant: many motor commands will produce the same effect. Lastly, in any real world setup, the number of samples that can be collected is limited. Thus, discovering diverse outcomes and learning policies to reproduce them requires more elaborate strategies.

One approach that was shown to be efficient in this context is known as Intrinsically Motivated Goal Exploration Processes (IMGEPs) (Baranes & Oudeyer, 2010; Forestier et al., 2017), an architecture closely related to Goal Babbling (Rolf et al., 2010). The general idea of IMGEPs is to equip the agent with a goal space. During exploration, the agent will sample goals in this goal space according to a certain strategy, before trying to reach them using an associated goal-parameterized reward function. For each sampled goal the agent will dedicate a budget of experiments to improve his performance regarding this particular goal. Crucially, the agent stores each outcome discovered during the exploration, which allows him to learn in hindsight how to achieve each outcome he discovers, should he later sample it as a goal. This makes the approach powerful since targeting a goal will often allow an agent to simultaneously learn about other goals. IMGEPs can be implemented with population-based policy learning approaches (Baranes & Oudeyer, 2010; Péré et al., 2018) or using goal parameterized deep reinforcement learning techniques (Colas et al., 2018a).

Until recently IMGEPs where limited to engineered goal spaces. This approach limits the autonomy of the agent, and in many interesting problems, such a goal space is not provided and may be hard to design manually. It is always possible to use the sensory space as a goal space. However, in many cases the sensory space is high-dimensional, e.g. made of low perceptual measures such as pixels, and building a goal-parameterized reward directly in this space is problematic. Thus, it was proposed in Péré et al. (2018) to leverage representation learning algorithms such as Variational Auto-Encoders (VAEs) and to use the learned latent space as the goal space. It was further shown in Laversanne-Finot et al. (2018) that if the representation used as a goal space is disentangled (e.g. encoding separately different physical properties of the environment), then it becomes possible to achieve more efficient exploration in environments with multiple objects and distractors, through a modular goal exploration algorithm that samples goals which maximize the learning progress.

However all these experiments were performed using simulated environments. Furthermore, they assumed the availability of many observations of outcomes produced by another agent, covering the diversity of possible outcomes, in order to train initially the goal space representation.

In this paper, we provide evidence that the ideas developed in those papers can also be successfully applied to real world scenarios. We also show how they can be transposed in a sample efficient manner to a fully autonomous learning setting where the representation learning mechanism is trained on outcomes data gathered autonomously by the agent. In particular we consider an experiment where a 6-joint robotic arm interacts with a ball inside a closed arena and we show that using a learned representation as a goal space leads to a better exploration of the environment than a strong baseline consisting in randomly sampling dynamic motion primitives.

2 Goal exploration with learned goal spaces

This section briefly introduces Intrinsically Motivated Goal Exploration Processes, using a learned representation of the goal space. The overall architecture is summarized in Figure 1. For a more thorough introduction to IMGEPs with engineered goal spaces and learned goal spaces we refer to Forestier et al. (2017) and Laversanne-Finot et al. (2018), respectively.

In order to understand the general idea of IMGEPs, one must imagine the agent as performing a sequence of contextualized and parameterized experiments. At the beginning of each experiment the agent will in sequence: observe the context, sample a goal according to some strategy, use its internal knowledge (policy) to find the best motor parameters to achieve this goal in this context, and then perform the experiment using these parameters. The goals are arbitrary and can range from “moving the ball to this specific position” to “moving the end effector of the arm to this location”, when the goal space is hand-crafted. If this is not the case one strategy proposed in Péré et al. (2018) is to learn a representation of the environment, using data sampled from demonstrations, and to use the latent space as the goal space. In this case a goal is a point in the latent space, and one uses a similarity function in this space as the associated goal achievement reward function. The agent then tries to produce an outcome that, when encoded, is as close as possible to this point in the latent space. See Algorithmic Architecture 1 for a high level algorithmic description of IMGEPs and Appendix 6.1 for more details on the different components.

3 Experiments

We carried out experiments on a real world environment to address the following questions:

•

To what extent can the ideas developed in simulated environments be applied on a real world setup?

•

Does the dataset used to train the representation algorithm need to contain examples of all possible outcomes to learn a goal space that gives good performances during exploration? Can it be learned during exploration, as example of outcomes are collected?

In order to answer those questions we experimented on a robotic setup that is similar in spirit to the environments considered in the simulated experiments and that we now describe in details:

Robotic environment

The environment is composed of a 6-joint robotic arm that evolves in an arena. In this arena a (tennis) ball can me moved around. Due to the geometry of the arena, the ball is more or less constrained to evolve on a circle. A picture of the environment is represented in Figure 2. The agent perceives the scene as a $64\times 64$ pixels image. The motion of the arm is controlled by Dynamical Movement Primitives (DMP). Actions are the parameters of the DMPs used in the current episode. There is one DMP per joint. Each DMP is parametrized by one weight for each of the $7$ basis functions and one supplementary weight specifying the end joint state, for a total of $48$ parameters.

For the representation learning phase, we considered different strategies. In the first strategy (as was done in Péré et al. (2018) and Laversanne-Finot et al. (2018)) we consider that the agent has access to a database of examples of the possible set of outcomes. From this database the agent learns a representation that is then used as a goal space for the exploration phase. This strategy is referred to as RGE (VAE). One could argue that using this method introduces knowledge on the set of possible outcomes that can be obtained by the agent. In order to test how this impacts the performances of the exploration algorithms we also experimented using a representation learned using only the samples collected during a the initial iterations of random motor exploration. We refer to this strategy as RGE (Online).

Baselines

Results obtained using IMGEPs with learned goal spaces are compared to two baselines:

•

Random Parameter Exploration (RPE), where exploration is performed by uniformly sampling parameters $\theta$ . This strategy is inefficient as it does not leverage information collected during previous rollouts to choose the current parameters. It serves as a lower bound for the performances of the exploration algorithms. Since DMPs were designed to enable the production of a diversity of arm trajectories with only few parameters, this lower bound is already a reasonable baseline that performs better than applying random joint torques at each time-step of the episode.

•

Goal Exploration with Engineered Features Representation (RGE-EFR): it is an IMGEP in which the goal space is handcrafted and corresponds (as closely as possible) to the true degrees of freedom of the environment. In this experiment it is not clear what is the best representation as multiple choices can be used (e.g. Cartesian or polar coordinates for the position of the ball). We settled for polar coordinates as the ball evolves on a circle. Since essentially all the information is available to the agent under a highly semantic form, it is expected to give an upper bound on the performances of the exploration algorithms.

4 Results

To assess the performances of the IMGEPs with learned goal spaces we performed between 8 and 14 trials for each of the configurations. In order to speed up the learning procedure, for each configuration using a learned goal space, we used the same representation for all trials111We did not pick a particular representation and preliminary experiments show that similar performances are obtained for other representations..

Exploration performances

The performance of the algorithm is defined as the number of ball positions reached during the experiments. In this configuration, the ball is the hard part of the exploration problem since the end position of the robotic arm can be efficiently explored by performing random motor commands. In practice the performances of the exploration algorithms are measured by discretizing the outcome space in 900 cells (30 cells for each dimension) and counting the number of different cells reached by the ball during the experiment. The number of cells that can be reached is limited due to the finite size of the arm/arena.

The exploration performances are reported in Figure 3. From the plot, it is clear that IMGEPs with both learned and engineered goal spaces perform better than the RPE strategy. When using a representation learned before exploration (RGE (VAE)) the performances are at least as good as exploration using the engineered representation. When the goal space is learned using the online strategy, there is an initial phase where the exploration performances are the same as RPE. However, after this initial collection phase, when the exploration strategy is switched from random parameter exploration to goal exploration using the learned goal space (at $2000$ exploration episodes) there is a clear change in the slope of the curve in favor of the goal exploration algorithm222Note that the first $2000$ exploration episodes for the online strategy are the same for all runs performed on the same platform. In practice it should be similar to the RPE curve which was performed with many more trials..

All in all, the differences in performances between IMGEPs and random parameter exploration are less pronounced than in past simulated experiments. We hypothesize that this is due to the ball being too simple to move around. Thus, the random parameter exploration, which leverages DMPs to produce diverse arm trajectories, achieves decent exploration results. Also the motors of the robotic arm are far from being as precise as in simulation, which makes it harder to learn a good inverse model for the policy and to output parameters that will move the ball.

5 Conclusion

In this paper we studied how learned representations can be used as goal spaces for exploration algorithms. We have shown in a real world experiment that using a representation as a goal space provides better exploration performances than a naive exploration of the space of motor commands.

One of the main advantages of using a learned goal space is that it alleviates the need to engineer a representation, which is not a simple task in general. For example, in the robotic setup it is not clear that the engineered representation used is the most convenient for the exploration algorithm. In this case, the position of the ball is parametrized using polar coordinates. In this representation two points that have the same distance to the center and have angles [math] and $2\pi$ are perceived as very distant even though physically they correspond to the same outcome. Also the position of the ball is extracted using a handcrafted algorithm. It may happen that this algorithm fails (e.g. when the ball is hidden by the robotic arm). In that case it may report wrong values to the policy. Such problems make learning an inverse model harder and thus reduce the exploration performances. On the other hand, using a learned representation obliviates those problems.

As mentioned in the paper, it is possible to imagine more involved goal selection schemes (see 6.6 for a short description of the results described in Laversanne-Finot et al. (2018)) when the representation is disentangled. These goal selection schemes leverage the disentanglement of the representation to provide better exploration performances. We tested these ideas in this experiment and did not find any advantages in using those goal selection schemes. This is not surprising since there are no distractors in this experiment and modular goal exploration processes are specifically designed to handle distractors. Consequently, designing a real-world experiment with distractors, in order to test modular goal exploration processes with learned goal spaces, would be of great interest for future work.

Acknowledgments

We would like to thank Sébastien Forestier for help in setting up the experiment. Simulated experiments presented in this paper were carried out using the PlaFRIM experimental testbed, supported by Inria, CNRS (LABRI and IMB), Université de Bordeaux, Bordeaux INP and Conseil Régional d’Aquitaine (see https://www.plafrim.fr/).

6 Appendices

6.1 Intrinsically Motivated Goal Exploration Processes

In this part, we give further explanations on Intrinsically Motivated Goal Exploration Processes.

Meta-Policy Mechanism

The (Meta-)Policy is responsible to outputs the actions/parameters that are used during the episode. Given a context $c$ and a goal $\tau$ the Policy should output the parameters $\theta$ that are the most likely to produce an observation $o$ that fulfills the task $\tau$ . That an observation $o$ fulfills a task $\tau$ can be quantified by a cost function $C:\mathcal{T}\times\mathcal{O}\mapsto\mathbb{R}$ .

There are two different ways to construct a meta-policy both which are depicted in Figure 4:

•

Direct-Model Meta-Policy: In this case, an approximate phenomenon dynamic model $\tilde{D}$ is learned using a regressor (e.g. LWR). The model is then updated regularly by performing a training step with the newly acquired data. At execution time, for a given goal $\tau$ , a loss function is defined over the parameterization space through $L(\theta)=C(\tau,\tilde{D}(\theta,c))$ . A black-box optimization algorithm, such as L-BFGS, is then used to optimize this function and find the optimal set of parameters $\theta$ (see (Baranes & Oudeyer, 2013; Forestier & Oudeyer, 2016; Benureau & Oudeyer, 2016) for examples of such meta-policy implementations in the IMGEP framework).

•

Inverse-Model Meta-Policy: In this approach, an inverse model $\tilde{I}:\mathcal{T}\times\mathcal{C}\mapsto\Theta$ is learned from the history $\mathcal{H}$ which contains all the previous experiments in the form of tuples $(c_{i},\theta_{i},o_{i})$ . To learn the inverse model it is necessary to associate to every observation $o_{i}$ a task $\tau_{i}$ . The inverse model can then be learned using usual regression techniques from the set $\{(\tau_{i},c_{i},\theta_{i})\}$ .

In our case, we took the approach of using an Inverse-Model based Meta-Policy. We draw the attention of the reader on the following implementation details:

•

It may happen that using different parameters one obtain the same final outcome. For example different movements of the arm can put the ball and the arm in the same final position. However, in general, a combination of parameters leading to the them outcome does not produce a similar outcome. This is often referred to as the redundancy problem in robotics or as a multi-modality issue (Pathak et al., 2018). To tackle this issue, we used a $\kappa$ -nn regressor with $\kappa=1$ .

•

In order to associate to each of the observations a goal we used the (either learned or engineered) embedding function. To the observation $o_{i}$ corresponds the goal $\tau_{i}$ defined through: $\tau_{i}:=R(o_{i})$ .

Our particular implementation of the Meta-Policy is outlined in Algorithm 2. The Meta-Policy is instantiated with one database per goal module. Each database store the representations of the observations projected on its associated subspace together with the associated contexts and parameterizations. Given that the meta policy is implemented with a nearest neighbor regressor, training the meta policy simply amounts to updating all the databases. Note that, as stated above, even though at each step the goal is sampled in only one module, the observation obtained after an exploration iteration is used to update all databases.

6.2 Deep Representation Learning Algorithms

In this section we summarize the theoretical arguments behind Variational Auto-Encoder (VAE).

Variational Auto-Encoders (VAEs)

Let $\mathbf{x}\in\mathcal{X}$ be a set of observations. If we assume that the observed data are realizations of a random variable, we can hypothesize that they are conditioned by a random vector of independent factors $\mathbf{z}$ , i.e. that $p(\mathbf{x},\mathbf{z})=p(\mathbf{z})p_{\theta}(\mathbf{x},\mathbf{z})$ , where $p(\mathbf{z})$ is a prior distribution over $z$ and $p_{\theta}(\mathbf{x},\mathbf{z})$ is a conditional distribution. In this setting, given a i.i.d dataset $X=\{\mathbf{x}^{1},\ldots,\mathbf{x}^{N}\}$ , learning the model amount to searching the parameters $\theta$ that maximizes the dataset likelihood:

[TABLE]

In practice it is often computationally intractable and so models are trained to optimize what is often referred to as the Evidence Lower Bound (ELBO):

[TABLE]

where $\mathbb{D}_{KL}$ is the Kullback-Leibler divergence, by jointly optimizing over the parameters (of often neural networks) $\theta$ and $\phi$ .

6.3 Details of Neural Architectures and training

Model Architecture

The encoder for the VAEs consisted of 4 convolutional layers, each with 32 channels, 4x4 kernels, and a stride of 2. This was followed by 2 fully connected layers, each of 256 units. The latent distribution consisted of one fully connected layer of 20 units parametrizing the mean and log standard deviation of 10 Gaussian random variables. The decoder architecture was the transpose of the encoder, with the output parametrizing Bernoulli distributions over the pixels. ReLu were used as activation functions. This architecture is based on the one proposed in Higgins et al. (2016).

Training details

The optimizer used was Adam Kingma & Ba (2015).

For the simulated experiment we used a learning rate of $5e^{-5}$ and batch size of 64. The overall training of the representation took 1M training iterations.

For the robotic experiment we used a learning rate of $1e^{-5}$ and batch size of $64$ and trained the network for 300k iterations when the representation was learned before the exploration. When the representation was learned with the outcomes obtained by the random exploration we used a batch size of $32$ the same learning rate and trained the network for $200k$ iterations.

6.4 Scatter plots Robotic environment

Scatter plots of the exploration for different exploration algorithms together with the number of cells reached are represented in Figure 5. Although the exploration of the outcome space of the arm is similar for all algorithms there is a qualitative difference in the outcomes obtained in the outcome space of the ball between RPE and all instantiations of IMGEPs.

6.5 Experimental setup

In practice experiments are performed in parallel using multiple copies of the same experiment. A picture of the complete experimental setup is represented in Figure 6. Only the 6-joints robotic arm in the center of the arena is used in the experiments presented in this paper. Camera extracting the images are located on the bar above the setup.

6.6 Modular Goal Exploration Processes

In this section we recap some of the results presented in Laversanne-Finot et al. (2018).

6.6.1 IMGEPs with modular goal spaces

As mentioned in the main text, when the environment is more complex and in particular when it contains distractors (objects that cannot be controlled), it is possible to design more efficient exploration algorithms. Modular goal exploration algorithms are designed to allow the agent to separate the exploration of different objects. For example the agent could decide to set for himself either goals for the ball or for its arm. The general idea is that some goals are harder (if not impossible) to reach than others. By monitoring its ability in fulfilling different kinds of goals the agent will be able to discover autonomously the difficulty of each type of goals and focus its exploration on goals which are neither too easy nor too hard. Using this strategy the agents thus autonomously design a curriculum. See Algorithmic Architecture 3 for the corresponding algorithmic architecture.

When the goal space is engineered, the different modules can be readily defined when designing the goal space. However, in the case of learned goal spaces there is no easy solution. The strategy proposed in Laversanne-Finot et al. (2018) is to form modules by grouping some of the latent variables together. The goals of one module are then to reach observations for which the latent variables corresponding to this module have specific values. If the representation of the world is disentangled, different latent variables encode for different degrees of freedom of the environment. In that case modules will correspond to distinct objects corresponding to the latent variables of this module. By monitoring its progress in controlling each of the latent variables the agent will discover that latent variables that encodes for distractors cannot be controlled while latent variables encoding for other objects can be controlled. The agent will thus be able to focus its exploration on controllable latent variables, leading to better exploration performances.

6.6.2 Results on Arm-2-Balls

The ideas of modular IMGEPs were tested in the Arm-2-Balls environment that is described below.

Arm-2-Balls The environment consists of a rotating 7-joint robotic arm that evolves in a scene containing two balls of different sizes, as represented in Figure 7. One ball can be grasped and moved around in the scene by the robotic arm. The other ball acts as a distractor: it cannot be grasped nor moved by the robotic arm but follows a random walk. The agent perceives the scene as a $64\times 64$ pixels image.

For the representation learning phase we used a Variational Auto-Encoder (VAE) for the entangled representation and a $\beta$ -VAE for the disentangled representation. $\beta$ -VAE are a variant of VAEs that have been argued to have better disentanglement properties (Higgins et al., 2016; 2017b; 2017a). To train the representation, we generated a dataset of images for which the positions of the two balls were uniformly distributed over $[-1,1]^{4}$ . This dataset was then used to learn a representation using a VAE or a $\beta$ -VAE. In order to test the impact of the disentanglement on the performances of the exploration algorithms, we used the same disentangled/entangled representation for all the instantiations of the exploration algorithms. This allowed us to study the effect of disentangled representations by eliminating the variance due to the inherent difficulty of learning such representations.

6.6.3 Scatter plots Arm-2-Balls environment

Examples of exploration curves obtained with all the exploration algorithms discussed in this paper (Figure 9 for algorithms with engineered features representation and Figure 10 for algorithms with learned goal spaces). It is clear that the random parameterization exploration algorithm fails to produce a wide variety of observations. Although the random goal exploration algorithms perform much better than the random parameterization algorithm, they tend to produce observations that are cluttered in a small region of the space. On the other hand the observations obtained with modular goal exploration algorithms are scattered over all the accessible space, with the exception of the case where the goal space is entangled (VAE).

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Baranes & Oudeyer (2010) Adrien Baranes and Pierre-Yves Oudeyer. Intrinsically motivated goal exploration for active motor learning in robots: A case study. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems , pp. 1766–1773. IEEE, 2010.
2Baranes & Oudeyer (2013) Adrien Baranes and Pierre Yves Oudeyer. Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems , 61(1):49–73, 2013. ISSN 09218890. doi: 10.1016/j.robot.2012.05.008 .
3Benureau & Oudeyer (2016) Fabien C. Y. Benureau and Pierre-Yves Oudeyer. Behavioral Diversity Generation in Autonomous Exploration through Reuse of Past Experience. Frontiers in Robotics and AI , 3(March), 2016. ISSN 2296-9144. doi: 10.3389/frobt.2016.00008 .
4Colas et al. (2018 a) Cédric Colas, Pierre Fournier, Olivier Sigaud, Mohamed Chetouani, and Pierre-Yves Oudeyer. Curious: Intrinsically motivated multi-task, multi-goal reinforcement learning. ar Xiv preprint ar Xiv:1810.06284 , 2018 a.
5Colas et al. (2018 b) Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. GEP-PG: Decoupling exploration and exploitation in deep reinforcement learning. In International Conference on Machine Learning (ICML) , 2018 b.
6Conti et al. (2017) Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. ar Xiv preprint ar Xiv:1712.06560 , 2017.
7Da Silva et al. (2014) Bruno Da Silva, George Konidaris, and Andrew Barto. Active learning of parameterized skills. In International Conference on Machine Learning , pp. 1737–1745, 2014.
8Forestier & Oudeyer (2016) Sébastien Forestier and Pierre Yves Oudeyer. Modular active curiosity-driven discovery of tool use. IEEE International Conference on Intelligent Robots and Systems , pp. 3965–3972, 2016. doi: 10.1109/IROS.2016.7759584 .