Hypothesis-Driven Skill Discovery for Hierarchical Deep Reinforcement Learning
Caleb Chuck, Supawit Chockchowwat, Scott Niekum

TL;DR
This paper introduces HyPE, a hierarchical skill learning algorithm that improves sample efficiency in deep reinforcement learning by discovering objects and testing hypotheses about their controllability from raw pixel data.
Contribution
The paper presents HyPE, a novel hypothesis-driven approach that enhances exploration and skill discovery in DRL through object-based hypotheses and hierarchical learning.
Findings
HyPE significantly outperforms state-of-the-art methods in sample efficiency.
HyPE successfully discovers objects and controllability hypotheses from raw pixel data.
HyPE achieves high scores faster in both robotic and game environments.
Abstract
Deep reinforcement learning (DRL) is capable of learning high-performing policies on a variety of complex high-dimensional tasks, ranging from video games to robotic manipulation. However, standard DRL methods often suffer from poor sample efficiency, partially because they aim to be entirely problem-agnostic. In this work, we introduce a novel approach to exploration and hierarchical skill learning that derives its sample efficiency from intuitive assumptions it makes about the behavior of objects both in the physical world and simulations which mimic physics. Specifically, we propose the Hypothesis Proposal and Evaluation (HyPE) algorithm, which discovers objects from raw pixel data, generates hypotheses about the controllability of observed changes in object state, and learns a hierarchy of skills to test these hypotheses. We demonstrate that HyPE can dramatically improve the sample…
| Random | Gripper | Block | Reward |
|---|---|---|---|
| Base | HyPE | Rainbow | A2C & PPO |
|---|---|---|---|
| 52,000 | 55,500 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\newfloatcommand
capbtabboxtable[][\FBwidth]
Hypothesis-Driven Skill Discovery for
Hierarchical Deep Reinforcement Learning
Caleb Chuck1, Supawit Chockchowwat1 and Scott Niekum1 1The University of Texas at Austin Personal Robotics and Automation Lab. Contact: [email protected]
Abstract
Deep reinforcement learning (DRL) is capable of learning high-performing policies on a variety of complex high-dimensional tasks, ranging from video games to robotic manipulation. However, standard DRL methods often suffer from poor sample efficiency, partially because they aim to be entirely problem-agnostic. In this work, we introduce a novel approach to exploration and hierarchical skill learning that derives its sample efficiency from intuitive assumptions it makes about the behavior of objects both in the physical world and simulations which mimic physics. Specifically, we propose the Hypothesis Proposal and Evaluation (HyPE) algorithm, which discovers objects from raw pixel data, generates hypotheses about the controllability of observed changes in object state, and learns a hierarchy of skills to test these hypotheses. We demonstrate that HyPE can dramatically improve the sample efficiency of policy learning in two different domains: a simulated robotic block-pushing domain, and a popular benchmark task: Breakout. In these domains, HyPE learns high-scoring policies an order of magnitude faster than several state-of-the-art reinforcement learning methods.
I Introduction
While recent advances in deep reinforcement learning (DRL) have been used to obtain exciting results on a variety of high-dimensional visual tasks, these algorithms often require large amounts of data in order to achieve good performance. In real-world domains such as robotics, this data is difficult and expensive to collect in sufficient quantity. When using high dimensional observations like images as input state, a natural factorization of state often exists that reduces the state space complexity. This factorization reduces the space of pixels to a space where only sparse instances of interaction occur [1]—for example, a key unlocking a door, a bat hitting a baseball, or gripper-object contact in robotic manipulation. Such interactions often create state-space bottlenecks [2]—a small subset of states which must be reached in order for the agent to access large regions of the state space. This feature of the state space can make exploration difficult, but can also present opportunities for hierarchical RL algorithms [3, 4, 5, 6] to learn options [7] that efficiently navigate between regions separated by bottlenecks. Unfortunately, hierarchical RL also often learns slowly because extrinsic reward must be experienced, often repeatedly, before the agent can start to learn meaningful behavior or skills. For sufficiently difficult sparse-reward tasks, even with state-of-the-art exploration methods [8, 9, 10], the agent may take an exceptionally long time to see a positive reward even once.
Thus, rather than working backwards from reward [11, 12], we propose building forward towards state bottlenecks by learning skills that control sparse interactions between objects in physics-based domains. These skills can then be used to navigate bottlenecks, explore efficiently, and maximize return. In this work objects are physical entities that obey certain laws. Additionally, we add abstract objects such as primitive actions or reward that can affect, or be affected by, physical objects. In this work, if an action (either a primitive action or a higher-level option that controls an object) can cause a predictable change in another object, we refer to that action as being causal of that change. The Hypothesis Proposal and Evaluation (HyPE) algorithm illustrated in Figure 1 exploits causal relationships between objects using a three step loop to learn policies which navigate state space bottlenecks:
Object discovery: In this step, we aim to discover factorized object representations from raw pixel inputs. Specifically, in each iteration of the HyPE loop, we attempt to discover one new object (the target object) by learning a set of convolutional filters whose outputs meet certain physics-guided criteria, and whose motion is explained by a previously discovered object or primitive action (the source object). 2. 2.
Skill proposal via hypothesis generation: We then generate one or more hypotheses about the specific changes that the source object can cause in the target object by observing interactions between them (i.e. how the state of the source object appears to influence the state of the target object). However, these hypothesized interactions may be spurious rather than causal, so each hypothesis must be made testable. To do so, each hypothesis is instantiated as an option whose goal is to cause a particular change in the target object via the source object. The actions available to the option are, in turn, previously learned options that control the source object, beginning with primitive actions at the beginning of the hierarchy. 3. 3.
Hierarchical skill learning via hypothesis evaluation: Finally, each interaction hypothesis is tested by using DRL to determine if the associated option is learnable. If an option policy can be successfully learned, the corresponding hypothesis is confirmed and the option hierarchy is extended permanently. The loop then continues from step 1. For example, in Breakout, options for controlling the ball can be learned by using options that control paddle displacement, which, in turn, use primitive actions. The next iteration of the HyPE loop could then use the new ball-control options to learn options which control the blocks. Figure 2 shows an illustration of such a hierarchy.
We evaluate HyPE in two domains: First, a simulated robotic pushing domain in which standard DRL methods exhibit poor sample efficiency. Second, the classic arcade game of Breakout, where HyPE improves the sample efficiency of policy learning on raw pixels by an order of magnitude, as compared to several state-of-the-art DRL algorithms.
II Problem Formulation
II-A MDP formulation
A Markov Decision Process (MDP) is defined by . At each time step , the agent observes a state (in our case, is an image), with starting state distribution , and takes primitive action . The next state is determined by , the probability distribution over subsequent state given current state and current action . The agent receives external reward as a function of the current state and action . The return is the discounted sum of rewards: , where is the discount factor. A policy is defined the probability of an action given state . Reinforcement learning searches for the policy that maximizes this total expected return.
II-B Hierarchical Skills via Options
Our skill hierarchy is based on the options framework [7]. An option is defined by the tuple , where is the initiation set, is a policy within the option, and is the termination condition. We simplify the initiation set to say all options are available everywhere: .
An option hierarchy [4] is a sequence of sets of options , where the action space for option are the options defined in , and the action space for are the primitive actions. Thus, executing an option from executes an option from , which itself executes an option from and so on, until an option from executes a primitive action.
The HyPE algorithm learns object specific option sets to learn a hierarchy of object control, treating primitive actions as the first object. Define an object state for object at time by the mapping from raw state to object factorized state. In this work, the object state is limited to an position, but this can be extended to any function of the state in future work. An object option set contains options, , where the terminal set of each option is
[TABLE]
and where is a single-step displacement corresponding to option . The policy of this option uses options from as the action space which are recursively defined as single step displacement of . For example, in breakout use displacements as actions, and controls displacements in . We simplify to . Figure 2 describes this object option hierarchy for the Breakout domain.
III Methods
In this section we introduce the HyPE loop, which at each iteration learns an object identification function and adds an option set to an object option hierarchy , starting from and , the options and state space for primitive actions. takes on the action on the current time step as state, and is the set of primitive actions. Each iteration performs three sub-steps. 1) discovers a new target object by learning an object identification function . This function tracks the object in the scene, learning using correlations with a source object . 2) proposes hypotheses about how the object can be controlled using , the options controlling the source object. 3) uses deep reinforcement learning to learn these options which produce the proposed control as the termination set of that new option. If the policy achieves non-trivial reward, HyPE adds to and keeps track of . The HyPE loop then iterates again using to discover additional new objects, terminating when it has achieved high task reward (The object option hierarchy is task specific). See Algorithm 1.
III-A Object Discovery
The object discovery step learns an object identification function—the mapping from images to object state —for target object . In order to learn this function, we use the following insight: we can discover new objects by tracking interactions with a source object learned in a previous iteration of the HyPE loop. In practice, we represent the object identification function as a convolutional neural network (CNN) which outputs a heatmap over the input image , and returns the pixel coordinate with highest intensity response.
For the object discovery step we optimize a loss function over a sequence of object states of source object and the target object (being learned) where
[TABLE]
The criteria measures the relevance of correlated interactions between and source object state . A correlated interaction is when there is a changepoint, when the motion of the target object changes significantly, during an eligible time, which is when two objects are likely to be interacting. We will define these formally. The second term penalizes the l2 distance between sequential outputs by . is optimized using a black box optimization algorithm, CMA-ES, over the weights of the CNN.
To define eligible, we use spatial proximity as a physical heuristic for the ability of objects to interact. Since our state is defined by coordinates of the objects, two objects are eligible when
[TABLE]
The hyperparameter specifies a pixel distance threshold. This depends on the geometry of the objects, so though ideally we would define the distance in terms of edges or nearest point analysis, for the purpose of simplicity, in this paper we use point distance. We define as an eligible time if satisfy Equation 3.
We detect changepoints using Changepoint Detection using Approximate Model Parameters (CHAMP) [13]. Changepoints are timesteps in a trajectory. Each pair of sequential time steps defines a segment , where a model approximates the state transition , We use the affine model class for , such that . D is an matrix and is a length 2 vector, learned by linear regression. The number of changepoints is discovered from data. With abstract objects like primitive actions, where proximity does not make sense as a criteria for eligible, we use changepoints in the object as eligible times.
When an eligible time co-occurs with a changepoint, we call this a correlated interaction. The F1 score measures the the harmonic mean of precision and recall between eligible times and changepoints. The intuition is that for the source object state to be correlated with the target object state, there should be more changepoints than usual when eligible—otherwise the changepoints can be seen to be independent of the source. Using the F1 score to quantify correlated interactions balances eligible times with changepoints. Without balancing, the CNN might learn to track the source object (always eligible) or constantly exhibit difficult to model displacements (many changepoints).
Given a trajectory , we can define length binary vectors for eligible times e and changepoints c, which are at an eligible time/changepoint respectively, and [math] elsewhere. The F1 score for correlated interactions is
[TABLE]
This is just one possible instantiation of eligibility and changepoints—future work could use heuristics or learned metrics.
The score achieved by the F1 component of provides a measure used to verify that object discovery has learned an object that is likely to be controllable through . If the final F1 score after learning is low, this means that the learned does not map to a feature that has many correlated interactions , and might be noise. This is a good criteria for deciding when object discovery should move to a different source object . In Breakout, for example, the abstract object for primitive actions has high F1 score when locating the paddle, because changing actions is highly correlated with changes in paddle motion. After the paddle is learned, however, the F1 score for primitive actions in the scene is low. This stopping criteria is:
[TABLE]
Though the vision system is sufficient to achieve the results we describe in Section IV, we acknowledge some shortcomings: first, the loss is defined pairwise between only two objects, meaning that objects which have multiple simultaneous interactions are difficult to identify. Second, it is necessary to remove already learned objects from , or they might be learned repeatedly. We do this by subtracting the mean of a fixed region around an already learned object from , which masks out the learned object. Finally, the vision loss uses a single image as input into a CNN of limited size (due to the computational cost of running CMA-ES, even for moderately sized neural networks). This will be addressed in future work by taking in multiple frames as input and designing a differentiable definition of eligibility and changepoints.
III-B Hypothesis Proposal
Object discovery determines that has correlated interactions with , but does not specify how to learn options to manipulate . Hypothesis generation constructs the set of possible motions, the hypotheses, that can effect on . Each proposed motion , is a hypothesis about the way that can causally manipulate . Learning these options in hypothesis evaluation would verify that causes this desired motion in .
Hypothesis proposal uses distinct motions to define the termination set for option . Motion is represented as a single timestep change in : . For states followed by , the termination set corresponding to hypothesized motion has the form
[TABLE]
To define motions , we want to specify as few motions as possible, while capturing all single timestep motions over caused by . Limiting the number of reduces the cost of policy learning in hypothesis evaluation. Thus, even though the set of could be all observed single timestep displacement after a correlated interaction (the set of correlated actions being ): , this set would include many spurious changepoints due to multi-object interactions and vision failures. Instead, to reduce noise, take as the training set for a DP-GMM (Dirichlet process Gaussian mixture model) [14] an unsupervised clustering model, and take clusters with the number of data points assigned to them above a fixed minimum . These cluster means are used as parameters . The are computed using the cluster variance of the Gaussian model corresponding to the respective cluster. This unsupervised method for discretizing and denoising object changes is one choice of method for defining the set of hypothesized control.
For the hypotheses: causes , we not only want to propose a set of possible motions, but also ensure that this set of motions is caused by . To do this, we specify the input space and (output) action space of the policy which will be learned to fulfill the termination condition. Thus, the input state of the target options is the state of the source object and the target object only. For example, a paddle-ball policy ignores block state. The action space is , or options manipulating by . This ensures that the option manipulates the target object via the source object , and is blind to objects other than the two it is learning the interaction between.
In summary, hypothesis proposal generates a set of possible options distinguished by a particular displacement motion that they effect on , using as actions and ignoring all state except .
III-C Hypothesis testing to determine causal links
Hypothesis testing differentiates correlation between two objects, as observed in the previous two steps, and causal relations, by attempting to learn the policies corresponding to the proposed option set using DRL.
Using DRL requires a reward function. We convert the termination condition of the option into an intrinsic reward for training the policy by:
[TABLE]
E, C, B are a 0-1 indicator functions of eligibility, changepoints and termination set respectively (Equation 3, Section III.A, Equation 6). This gives nonzero reward of only for correlated interactions which produce the desired single timestep displacement .
We use DRL with this reward function and test a variety of different DRL algorithms, including actor critic (PPO) [15], Actor (policy iteration), Critic (Q-learning) [16] and black box (CMA-ES) [17].
In order to exploit the object factorized structure of , we utilize a neural net which computes:
[TABLE]
Where are basic input features computed from , such as relative position or velocity, while still including and . This network then expands each input feature into a length embedding, where is a matrix of weights (all input features have dimension 2), and is the rectified linear unit. It takes the mean of all embeddings vectors and feeds these forward to action logits, using a softmax operation to convert these to probabilities.
To train the options simultaniously, corresponding to each , we randomly switch between executing for a fixed duration, and perform off-policy updates when amenable. When using an on-policy DRL algorithm, then we update on-policy.
Hypothesis evaluation assesses the causal relation between by comparing the expected return of the learned policy (computed using the intrinsic reward in Equation 7), with the expected return of a random policy .
[TABLE]
A policy which satisfies the above criteria produces the hypothesized state change more often than a random policy. This is a usable option for manipulation of , so we add to the .
Since uses option set to manipulate , and has a limited input space only including , this policy can be seen as an intervention where causally controls . In addition, learning at least one option demonstrates some control over by the agent, since the base node of the object option chain is primitive actions.
III-D Overall HyPE Loop
The HyPE algorithm applied to a new domain, as described in Algorithm 1, repeatedly loops between object discovery, hypothesis proposal and hypothesis evaluation. In order to begin object discovery, historical data is initialized by collecting data from a policy which takes random actions. This sample also forms the baseline comparison for Equation 9.
The only option set in on the first iteration of the HyPE loop corresponds to primitive actions. collects the actions taken, and corresponds to the the primitive action. Option discovery optimizes Equation 2 with using trajectories from as training data. The object identification function returns a with correlated interactions with . Then, hypothesis proposal and evaluation will learn the option set to manipulate , and add to .
The subsequent iterations of the loop will add new option sets to the chain from the bottom up. The loop starts object discovery with as the source object. If optimizing Equation 2 fails to learn a that satisfies Equation 5, or the hypothesis evaluation fails (Equation 9), the object discovery restarts with . This traversal heuristic assumes that objects more directly manipulated will be easier to learn about. In the case of multiple chains (i.e. multiple objects are directly manipulated by some ), HyPE finds the shortest chain to manipulate reward by discovering objects in a breadth-first style search.
IV Results
We demonstrate the capabilities of the HyPE algorithm in two domains. First, we show HyPE learns policies which achieve good performance in a perfect perception robotic pushing domain, where classical reinforcement learning methods perform poorly. In the robotic domain, common DRL baselines struggle to learn fairly intuitive policies because of the state space bottlenecks in pseudo-physical domains. We also show HyPE learns high scoring policies from pixels in a sample efficient manner in the classic game (and deep DRL benchmark task) Breakout, where standard deep DRL policies take many more timesteps.
IV-A Robotic Pushing Domain
In Our 2-D robotic pushing domain, a gripper, controlled in cardinal directions, manipulates a block by pushing it into a target location. The agent receives non-zero reward of 1 only if the block contacts a target area. Episodes end when the block contacts the target area, or after 300 timesteps. All three objects have randomly initialized positions. This domain is challenging for standard RL because the reward is extremely sparse: a random policy takes on average time steps to stumble upon non-zero reward.
In the pushing domain, we seek to demonstrate that HyPE provides clear benefits beyond factorized state. Thus, the robot pushing domain has perfect perception. Incorporating perception with the full HyPE loop is the focus of the Breakout experiments. The state consists of three pairs of pixel coordinates corresponding to the gripper, block, and target. HyPE iterations in the robotic pusher domain start with hypothesis proposal to learn relations between the paired positions. Hypothesis evaluation learns options that perform the desired behavior, utilizing Proximal Policy Optimization (PPO) [15] with the 0-1 reward and with (as defined in Section III.C).
The HyPE loop begins with only and , and initializes the dataset with random actions until it picks up a correlation between primative actions and another state variable, which takes random steps. HyPE proposes five motions over the gripper (), left, right, up, down, and stationery, with pixels as cluster means of the learned DP-GMM. Hypothesis evaluation learns a policies corresponding to these options in time steps.
The data from this step allows hypothesis proposal about block control. However, because block changepoints occur infrequently, the initial options only capture limited control of the block, without rewarding all directions. After time steps, however, the options are re-specified with left, right, up, down motions, four . HyPE learns these policies in an average of time steps of training, leading to policies which push the block in the desired direction.
The third iteration of the HyPE loop reveals that the relationship between the block and target is correlated with extrinsic reward (and end of the episode). HyPE proposes options to control the extrinsic reward by controlling the block. Since this is a multi-object interaction, where a changepoint in extrinsic reward based on the relative positions of block and target. We augmented HyPE hypothesis proposal by allowing proximity between (or any other pair of objects) to be correlated with changepoints in . The policy for controlling extrinsic reward learns in an average of time steps. Figure 3 shows the full performance of HyPE in the pushing domain. Learning the reward option set requires fewer time steps than the block option set, because HyPE uses control of the block as the action space for this option, and only needs to plan using the block-target relative information.
By comparison, A2C [16], Proximal Policy Optimization (PPO) and Rainbow [18] trained on the same domain, using the baseline reward, return policies that do not perform better than random even after time steps. Even when given a shaped reward, which is equal to , a scaled negative l1 norm shaped reward between the block and the target, these standard RL algorithms fail to learn meaningful policies. We constructed one baseline which succeeded in pushing the block to the goal approximately of the time after time steps (as compared to the success rate from HyPE). The specialized HyPE-like reward gave reward for moving the block, and a reward for end of episode. This demonstrates not only that HyPE learns a very reasonable set of sub-tasks, but also that the action hierarchy of HyPE, the main difference between this baseline and HyPE, is invaluable in solving some tasks. This results overall demonstrate how the object option chain from HyPE can be used to solve problems which would by standard RL be infeasible. Table I shows the timesteps needed for learning to control the different objects.
IV-B Breakout Domain
We add object discovery to the HyPE loop in Breakout (Figure 1). As before, starting from only and , HyPE takes random actions until it discovers an object correlated with primitive actions. After frames of random data, HyPE learn a object detection CNN to locate the paddle by optimizing the loss defined in Equation 2. With sufficient F1 score to pass Equation 5, the loop proposes and learns to control the paddle or [math] pixels in 2.5k timesteps. The HyPE loop adds object identification function and option set .
Using the cumulative data, the HyPE loop then optimizes the F1 score starting with as the source object. With the paddle removed from the image, the F1 score does not pass Equation 5. However, using , object discovery learns a CNN which tracks the ball. Due to the rarity of ball bounces, the proposed option only bounces the ball off the paddle—it groups all angles together, leaving a single ball control option. Learning this option is sufficient to control extrinsic reward, so HyPE terminates.
In Figure 4, We show that HyPE has an order of magnitude improvement in sample efficiency when compared to standard RL methods on Breakout. HyPE, at train time, achieves average train reward per episode of in 55k frames, while Rainbow [18] takes 400k timesteps, and Proximal Policy Optimization [15] and A2C [16] take roughly 1.4M timesteps to achieve the same performance. However because CMA-ES, which used to learn the HyPE policies in Breakout, typically has substantially higher test than train performance, comparing training performance understates the performance of the learned policy. The evaluation policy learned by HyPE after 55k frames achieves average reward per episode performance. Rainbow, the best performing baseline, takes timesteps to achieve the same performance.
Note that though the HyPE loop learns intuitive objects (the paddle, ball), this is not encoded explicitly anywhere in the algorithm but emerges from physical priors and controllability. HyPE has the same information as the standard RL algorithms, but uses object priors to achieve high sample efficiency.
V Related Work
This work combines ideas from causality and relational learning, model-based reinforcement learning, and hierarchical reinforcement learning. It uses these ideas to construct a hierarchical reinforcement learning problem with intrinsic rewards.
Causal graphs: The hypothesis proposal and evaluation ideas draw from causality literature [20]. These components of HyPE relate to where causal graphs [21] and graph-dependent policies [22] learned from interactions with the environment. HyPE also learns causal object relationships [23, 24, 25, 26] which has similarities to object-oriented and relational reinforcement learning [27, 28, 29, 30]. Though HyPE uses a similar object-oriented relational structure, it learns object-object interactions one at a time for highly efficient hierarchical reinforcement learning and visual factorization.
Model-based RL: Model-based reinforcement learning using a learned model has shown substantial improvements in reinforcement learning sample efficiency in Atari games [31]. These methods often learn to predict future raw state [32], or latent space [33] in tandem to learning a useful policy [34]. They then incorporate planning [35, 36, 37] with constructed environment models [38]. HyPE makes loose use of modeling to generate different options, but once it learns control policies, it applies model-free reinforcement learning and should be improved by incorporating Model-based RL.
Hierarchical RL and Exploration: Learning hierarchies of control with options has been studied in detail [39, 3], and can be used to define a system for learning skills and state spaces [4, 40, 41, 42, 43, 44]. Exploration work has used novelty [45, 46, 47], frontier states [48, 49], model prediction error [50, 9], sub-goals from hindsight [51], bottleneck regions [3] or contingency [52, 53] as other exploration objectives. While the HyPE loop is inspired by these works, it incorporates object option hierarchies and physical priors.
VI Conclusion
We introduced the HyPE algorithm, which explores physically inspired state space bottlenecks to efficiently learn to hierarchically explore and control its environment. By taking advantage of some physically inspired priors like proximity and changepoints, this system learns high performing policies in RL settings. Though this system is less application agnostic than classic general-purpose RL algorithms, it achieves sample efficiency that is an order of magnitude better than standard RL methods on multiple domains. Future work can address the practical issues required to extend HyPE to physical real-world domains. Furthermore, the object option chain structure generated by HyPE may have implications for both explainable AI and transfer learning.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] O. Kroemer, S. Niekum, and G. Konidaris, “A review of robot learning for manipulation: Challenges, representations, and algorithms,” ar Xiv preprint ar Xiv:1907.03146 , 2019.
- 2[2] Ö. Şimşek and A. G. Barto, “Skill characterization based on betweenness,” in Advances in neural information processing systems , pp. 1497–1504, 2009.
- 3[3] P.-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture,” in Thirty-First AAAI Conference on Artificial Intelligence , 2017.
- 4[4] G. Konidaris, “Constructing abstraction hierarchies using a skill-symbol loop,” in IJCAI: proceedings of the conference , vol. 2016, p. 1648, NIH Public Access, 2016.
- 5[5] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, “Feudal networks for hierarchical reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pp. 3540–3549, JMLR. org, 2017.
- 6[6] A. Sharma, S. Gu, S. Levine, V. Kumar, and K. Hausman, “Dynamics-aware unsupervised skill discovery,” in Proceeding of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia , pp. 26–30, 2020.
- 7[7] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence , vol. 112, no. 1-2, pp. 181–211, 1999.
- 8[8] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos, “Count-based exploration with neural density models,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pp. 2721–2730, JMLR. org, 2017.
