Towards Learning to Imitate from a Single Video Demonstration
Glen Berseth, Florian Golemo, Christopher Pal

TL;DR
This paper presents a method for training reinforcement learning agents to imitate behaviors from a single video demonstration without access to explicit state or action data, using contrastive learning and a Siamese network.
Contribution
The authors introduce a contrastive training approach with a Siamese network to learn reward functions from a single video, improving imitation learning in RL agents.
Findings
Outperforms state-of-the-art in simulated humanoid, dog, and raptor environments.
Multi-task data and image encoding losses enhance reward consistency and policy learning.
Successfully learns imitation from a single video demonstration in diverse agents and environments.
Abstract
Agents that can learn to imitate given video observation -- \emph{without direct access to state or action information} are more applicable to learning in the natural world. However, formulating a reinforcement learning (RL) agent that facilitates this goal remains a significant challenge. We approach this challenge using contrastive training to learn a reward function comparing an agent's behaviour with a single demonstration. We use a Siamese recurrent neural network architecture to learn rewards in space and time between motion clips while training an RL policy to minimize this distance. Through experimentation, we also find that the inclusion of multi-task data and additional image encoding losses improve the temporal consistency of the learned rewards and, as a result, significantly improves policy learning. We demonstrate our approach on simulated humanoid, dog, and raptor agents…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Reinforcement Learning in Robotics
Towards Learning to Imitate
from a Single Video Demonstration
\nameGlen Berseth \addrUniversité de Montréal, Mila Quebec AI Institute, and Canada CIFAR AI Chair \[email protected]
\AND\nameFlorian Golemo \addrMila Quebec AI Institute \[email protected]
\AND\nameChristopher Pal \addrPolytechnique Montréal, Mila Quebec AI Institute, ServiceNow Research, and Canada CIFAR AI Chair \[email protected]
Abstract
Agents that can learn to imitate behaviours observed in video – without having direct access to internal state or action information of the observed agent – are more suitable for learning in the natural world. However, formulating a reinforcement learning (RL) agent that facilitates this goal remains a significant challenge. We approach this challenge using contrastive training to learn a reward function by comparing an agent’s behaviour with a single demonstration. We use a Siamese recurrent neural network architecture to learn rewards in space and time between motion clips while training an RL policy to minimize this distance. Through experimentation, we also find that the inclusion of multi-task data and additional image encoding losses improve the temporal consistency of the learned rewards and, as a result, significantly improve policy learning. We demonstrate our approach on simulated humanoid, dog, and raptor agents in 2D and quadruped and humanoid agents in 3D. We show that our method outperforms current state-of-the-art techniques and can learn to imitate behaviours from a single video demonstration.
Keywords: Reinforcement Learning, Deep Learning, Imitation Learning
1 Introduction
Imitation learning gives an agent the ability to reproduce the behaviours and skills of other agents through demonstrations (Blakemore and Decety, 2001). These demonstrations act as a type of explicit communication, informing the agent of the desired behaviour. However, real-world agents such as robots do not generally have access to the type of information needed by many imitation learning methods, such as internal state information or the executed actions of a demonstration. We also want a solution that can learn to imitate even if an observed agent has a different appearance or dynamics in the demonstration compared to the agent that is tasked with learning from the demonstration. In the same way that human children can learn to imitate adults by observing them, we want more versatile agents who can learn to imitate desired behaviours, given only a few examples. This type of visual-imitation is depicted in Figure 1 with an agent that learns and minimizes the distance between visual demonstrations. If we can construct agents that can learn to imitate from this kind of easy-to-obtain but noisy data, they would be much more flexible for learning via imitation in the real world.
While imitating a behaviour from observation is natural to many agents in the real world, it poses many learning challenges. Behaviour cloning (BC) methods often require expert action data making them impossible to use in more natural problem settings where only image observations are available. Also, in the real world, the demonstration agent often has different dynamics, meaning that an exact copy of the demonstration is impossible, and the learning agent must do its best to replicate the observed behaviour under its own dynamics. How can we train an agent to reliably imitate another agent with potentially different dynamics, given only image data from the demonstration agent? Learning a well-formed and smooth distance between observed behaviours can be used as the reward for a reinforcement learning agent. However, given the partially observed nature of the demonstration data, learning a reasonable distance function is challenging. To compensate for limited and noisy data, a method that allows us to incorporate additional offline data, potentially from other behaviours/tasks, will increase data efficiency while training a model that understands the comparative landscape of a larger behaviour space.
To realize the data-efficient and task-independent method described above, we train a recurrent Siamese comparator to capture the partial information from each image and use this comparator to compute distances used for rewards for an RL agent. We train this comparator model with off-policy data to learn distances between sequences of images (videos). This offline training makes it possible to pretrain and include data from additional behaviours/tasks to increase model robustness. Auto-encoding losses are added, shown in Fig. 2(a), at different levels of granularity to increase the smoothness of the learned distance landscape. Our model learns two latent distance predictors in parallel. These two latent distance predictors are shown in Fig. 2(b) and allow us to compute distances between individual images and between sequences. The image-to-image latent distance rewards the agent for precisely matching the example behaviour. In contrast, the learned sequence-to-sequence latent representation provides additional reward when the agent is not just matching the desired behaviour exactly but also when the agent is replicating portions of the observed demonstration, for example, if the agent currently has different timing than the demonstration. Our results show that the robustness and smoothness gained from the combination of losses improve training efficiency and final policy quality.
Our contribution consists of a new visual-imitation learning method for RL based on visual comparisons and the specific architectures and training procedures discussed in more detail throughout the paper. We showcase that our approach enables agents to learn a large variety of challenging behaviours that include walking, running and jumping. We perform experiments for multiple simulated robots in both 2D and 3D, including simulations for training quadruped robots and a humanoid with degrees of freedom (DoF). For many of these imitation tasks, our method “visual-imitation with reinforcement learning” (VIRL) is able to imitate the given skill using only a single observed demonstration.
2 Related Work
We group the most relevant prior work based on the type and quantity of data needed to perform imitation learning. The first group consists of GAIL (Ho and Ermon, 2016) and related methods, which require access to expert policies, states, actions, and require large quantities of expert data. In the second group, the need for expert action data is relaxed in methods like generative adversarial imitation from observation (GAIfO) (Torabi et al., 2018b), but still, requires an expert policy to repeatedly generate data. The third group avoids the need for ground truth states favouring images that are easier to obtain (Brown et al., 2019, 2020). These methods still require many examples of data from a policy trained on an agent in the same simulation with the same dynamics. Lastly, in a fourth group, the need for multiple demonstrations and matching dynamics is relaxed as in methods such as time-contrastive network (TCN) (Sermanet et al., 2018) and ours.
Imitation learning.
Methods such as generative adversarial imitation learning (GAIL) (Ho and Ermon, 2016), use the generative adversarial network (GAN) (Goodfellow et al., 2014) framework and applies it in the context of learning an RL policy. In GAIL, the GAN’s discriminator is trained with positive examples from expert trajectories and negative examples from the current policy. However, using a discriminator is only one possible way of measuring the probability of that agent’s behaviour matching the expert (Abbeel and Ng, 2004; Argall et al., 2009; Finn et al., 2017a; Brown et al., 2019; Nachum and Yang, 2021). Given a vector of features, distance-based imitation learning aims to find an optimal transformation that computes a more meaningful distance between expert demonstrations and agent trajectories. Previous work has explored the area of state-based distance functions, but most rely on the availability of an expert policy to continuously sample data (Ho and Ermon, 2016; Merel et al., 2017). In the section hereafter, we demonstrate how VIRL learns a more stable distance-based reward over sequences of images (as opposed to states) and without access to actions or expert policies.
Imitation without action data.
For learning from demonstrations (LfD) problems, the goal is to replicate the behaviour of an expert . GAIfO (Torabi et al., 2018b) has been proposed as an extension of GAIL that does not require data on the expert actions. However, GAIfO and other recent works in this area require access to an expert policy for sampling additional states during training (Sun et al., 2019; Yang et al., 2019). By comparison, our method can work with a single fixed demonstration example with different dynamics. Other recent work uses behavioural cloning (BC) to learn an inverse dynamics model to estimate the actions used via maximum-likelihood estimation (Torabi et al., 2018a). Still, BC often needs many expert examples and tends to suffer from state distribution mismatch issues between the expert policy and student (Ross et al., 2011a).
Additional works learn implicit models of distance that require large amounts of demonstration data and none of these explicitly learn a sequential model considering the demonstration timing (Yu et al., 2018; Finn et al., 2017b; Sermanet et al., 2018; Merel et al., 2017; Edwards et al., 2019; Sharma et al., 2019). The work in Wang et al. (2017); Li et al. (2017); Peng et al. (2019, 2021) includes a more robust GAIL framework with a new model to encode motions for few-shot imitation. However, they need access to an expert policy to sample additional data. In this work, we train recurrent siamese networks (Chopra et al., 2005) to learn more meaningful distances between videos. Other work uses state-only demonstration to out-perform the demonstration data but requires many demonstrations and ranking information to be successful (Brown et al., 2019, 2020). We show results on more complex 3D tasks and additionally model distance in time, i.e. due to the embedding of the entire sequence, our model can compute meaningful distances between agent and demonstration even if they deviate in time.
Imitation from images.
Some works like Finn et al. (2017b); Sermanet et al. (2018); Liu et al. (2018); Dwibedi et al. (2018), use image-based inputs instead of states but require on the order of hundreds of demonstrations. Further, these models only address spatial alignment, matching joint positions/orientations to a single state, and can not implicitly provide an additional signal related to temporal ordering between expert demonstration and agent motion like our recurrent sequence model does. Other works that imitate image-based information do so only between goal states (Pathak* et al., 2018).
Imitation from few images with different dynamics.
Comparative methods like TCN use metric learning to embed simultaneous viewpoints of the same object (Sermanet et al., 2018). They use TCN embeddings as features in the system state which are provided to PILQR (Chebotar et al., 2017) reinforcement learning algorithm, which combines model-based learning, linear time-varying dynamics and model-free corrections. In contrast, our Siamese network-based approach learns the reward for an arbitrary subsequent RL algorithm. Our method does not rely on multiple views, and we use a recurrent neural network (RNN)-based autoencoding approach to regularize the distance computations used for generated rewards. These choices allow VIRL to achieve performance gains over TCN, as shown in Section 5.
3 Preliminaries
In this section, we provide a very brief review of the fundamental background used by our method. reinforcement learning (RL) is formulated within the framework of a Markov decision process (MDP) where at every time step , the world (including the agent) exists in a state , where the agent is able to perform actions . The action to take is determined according to a policy which results in a new state and reward according to the transition probability function . The policy is optimized to maximize the future discounted reward , where is the max time horizon, and is the discount factor. The formulation above generalizes to continuous states and actions, which is the situation for the agents we consider in our work.
Imitation Learning.
Imitation learning is typically cast as the process of training a new policy to reproduce expert policy behaviour. Behaviour cloning is a fundamental method for imitation learning. Given an expert policy possibly represented as a collection of trajectories a new policy can be learned to match this trajectory using supervised learning and maximizing the expectation . While this simple method can work well, it often suffers from distribution mismatch issues leading to compounding errors as the learned policy deviates from the expert’s behaviour (Ross et al., 2011b). Inverse reinforcement learning avoids this issue by extracting a reward function from observed optimal behaviour (Ng et al., 2000). In our approach, we learn a distance function that allows an agent to compare an observed behaviour to its current behaviour to define its reward at a given time step. Our method only requires a single reference activity, but the comparison network can be trained across a collection of different behaviours. Further, we do not assume the example data to be optimal. See Appendix 7.2 for further details of the connections of our work to inverse reinforcement learning.
Variational Auto-encoders
VAEs are a popular approach for learning lower-dimensional representations of a distribution (Kingma and Welling, 2014). A VAE consists of two parts, an encoder , with parameters and a decoder with parameters . The encoder maps inputs x to a latent encoding z, and, in turn, the decoder transforms z back to a reconstruction . The model parameters for both and are trained jointly to maximize
[TABLE]
where is the Kullback-Leibler divergence and is a prior distribution over the latent space. The encoder (inference model) takes the form of a multivariate diagonal covariance distribution , where the mean and variance are typically given by a deep neural network.
VAEs have been extended to work on sequences of images with the inclusion of an RNN, e.g. in Chung et al. (2015) and the notation in the following section follows the one from that work. Sequence-to-sequence models can be used to learn the conditional probability of one sequence given another , where and are sequences. Here, we will use extensions of encoder-decoder recurrent neural networks which learn a latent representation h that compresses the information. For VIRL we reuse the VAE encoder mean from Equation 1 to encode individual images for which we learn the sequence encoder , denoted as with parameters . Conversely, a sequence decoder , conditioned on h, reconstructs the original input sequence . First by producing a decoded sequence of the correct length with parameters and second, using the decoder from Equation 1 by . The loss for decoding the original sequence can then be written as the distance
[TABLE]
This method works for learning compressed representations for transfer learning (Zhu et al., 2016) and 3D shape retrieval (Zhuang et al., 2015). In our case, this type of autoencoding can help regularize our model by forcing the encoding to contain all the information needed to reconstruct the trajectory.
4 Visual Imitation with Reinforcement Learning
In this work, we create a new method for performing imitation from only visual data (no actions) and use reinforcement learning to fill in the missing action data by training the agent to find the actions that result in matching the original distribution of observations.
The Sequence Encoder/Decoder Networks
Figure 2(a) shows an outline of the network design. There are 2 LSTMs, one for sequence encoding and one for sequence decoding, as well as one encoder CNN, and one decoder CNN, all of which are shared across the agent and expert, similar to a Siamese network. A single convolutional network is used to transform observations (images) at every timestep of the demonstration from the expert or the agent to the corresponding encoding vector . After the observations are passed through the image encoder, the result is an encoded sequence (for the expert in this example), this sequence is fed into the LSTM sequence encoder until a final encoding is produced . This same process is performed over the agent data producing . These sequence encodings and are fed into the sequence decoder separately, which generates a series of decoded latent representations which are then decoded back to images with a deconvolutional network . The same process is applied to both the agent and expert using the same image encoder and decoder and sequence encoder , and decoder .
Loss Terms
The siamese-loss between a fully encoded demonstration sequence and a sequence of the agent forces not just individual frames but the representation of entire sequences to match if they are from the same policy. This siamese network sequence loss is defined in Eq.3. A frame-by-frame siamese-loss between of the demonstration and of the agent encourages individual frames to have similar encodings as well. This siamese network image loss uses Eq.3 as well but is trained over pairs of images. We define The Siamese network loss (both for images and sequences) as:
[TABLE]
where is the indicator for negative/positive samples. When , the pair is positive, and the distance between current observation to positive sample should be minimal. When , the pair is negative, and the distance between and negative example should be increased. In subsection 4.1 the details of how the positive and negative examples are constructed is outlined. We compute this loss over batches of data that are half positive and half negative pairs. The margin is set to and is used as an attractor or anchor to pull the negative example output away from and push values towards a range. computes the output from the underlying network (i.e. Conv or LSTM). The data used to train the Siamese network is a combination of observation trajectories generated from simulating the agent in the environment and the demonstration. For our recurrent model the observations are sequences. We additionally train the encoding of a single observation of either agent or expert at a given timestep using the VAE loss from Eq.1. Lastly, the entire sequence of observations of both the agent and expert is encoded and then decoded back separately, as shown in Fig. 13, and the are trained with the loss from Eq.2. We found using these image- and sequence-autoencoders important for improving the latent space conditioning. This combination of image-based and sequence-based losses assists in compressing the representation while ensuring intermediate representations remain informative. The combined loss to train the model is:
[TABLE]
Where the relative weights of the different terms are , the image encoder convnet is , the image decoder , the recurrent encoder , and the recurrent decoder . The weights for are found by empirically evaluating VIRL over all environments from section 5. Additional details on the hyperparameter search can be found in subsection 7.8.
**Reward Calculation **
The model trained using the loss function described above is used to calculate the distance between two sequences of observations seen up to time as and the reward as . During RL training, we compute a distance given the sequence observed so far in the episode. The sequence-based distance can model time-invariant distances, and the image-based distance can match the expert demonstration more precisely. In subsection 5.2 we experimentally show the importance of each distance for imitation learning.
Training the Model
Details of the algorithm used to train the distance metric and policy are in 1. We consider a variation on the typical RL environment that produces three different outputs, two for the agent and one for the demonstration and no reward. The first is the internal robot pose and link velocities, which we refer to as the state . The second and third are images of the agent, or observation and the demonstration , shown in 2(b). The images are used with the distance metric to compute the similarity between the agent and the demonstration. We train the agent’s policy using the trust-region policy optimization (TRPO) algorithm (Schulman et al., 2015). The policy uses the state as input, which is easier to access than the 3rd person video generated by the agent during test time, and is trained online (line ) using the learned distance function (line ). The use of off-policy training increases the LSTM-based distance function’s training efficiency. The off-policy training also allows us to train the distance function using data from other tasks to increase the robustness of the model while fine-tuning the current task, as we will describe in Section 5.2.
4.1 Unsupervised Data Labelling and Generation
To construct positive and negative pairs for training, we make use of time information and adversarial information. We use timing information where observations at similar times in the same sequence are often correlated, and observations at different times are less likely to be similar. We use these ideas to provide labels for the positive and negative pairs to train the Siamese network. Positive pairs are created by adding Gaussian noise with to the images in the sequence, duplicating, or shifting random frames of the sequences. Negative pairs are created by shuffling, cropping or reversing one sequence. VIRL will learn to decode these modified sequences which helps the model be robust to noise. Additionally, we include adversarial pairs where positive pairs come from the same distribution, for example, two motions for the agent or two from the demonstration at different times. Negative pairs then include one from the expert and one from the agent. A combination of these augmentations are chosen randomly and applied to the current training batch and are not added to the replay buffer. Additional details on how the shuffling, swapping, and the use of adversarial pairs are available in the Appendix 7.7.
Data Augmentation
We apply several data augmentation methods to produce additional data variation for training the distance metric. Using methods analogous to the cropping and warping methods popular in computer vision (He et al., 2015) we randomly crop sequences and randomly warp the demonstration timing. The cropping is performed by removing later portions of the training sequences during batch updates. These augmentations allow the distance function to learn a more general representation of motions, such as a walk even if the walking demonstration includes a single step or multiple steps at a different speed and pauses between steps. This cropping is denoted as early episode sequence priority (EESP) where the probability of cropping out a window in the sequence x at is , increases the likelihood of cropping earlier in the sequence. As the agent improves, the average length of each episode increases, and so too will the average length of the cropped window. The motion warping performs a type of scaling to the time dimension of the demonstration. This is accomplished by creating a continuous function of the demonstration so we can sample the motion at different speeds . For example, if the motion is replayed at the motion will be twice as long as . Linear interpolation is used to fill in information between frames. This allows for training the distance function to recognize that, for example, a walking motion with one step in it and another with two steps are still examples of a walk. Last, we use reference state initialization (RSI) (Peng et al., 2018), where we generate the initial state of the agent and expert randomly from the demonstration. With this property, the environment functions as a form of memory replay. The environment allows the agent to go back to random points in the demonstration as if replaying a remembered demonstration and collect new data from that point in the demonstration. The experiments in Sec. 5.2 show the importance of these augmentation methods in terms of improving the robustness of the learned comparator and resulting policy.
5 Experiments, Results and Analysis
We evaluate VIRL compared to previous methods in terms of sample efficiency and task-solving capability. The comparison is over a collection of different simulation environments. In these simulated robotics environments, we task the agent with imitating a given reference video demonstration. Each simulation environment provides a hard-coded reward function based on the agent’s pose to evaluate the policy quality independently. The imitation environments include challenging and dynamic tasks for humanoid, dog and raptor robots. Some example tasks are running, jumping, trotting, and walking, shown in Fig. 3 and Fig. 4. The demonstration the agent uses for visual imitation learning is produced from a clip of motion capture data for each task. The motion capture data animates a kinematically controlled robot in the simulation for capturing video. Because the demonstration is kinematically generated, the agent also needs to learn how to bridge the gap between the different dynamics in the demonstration that may be impossible to reproduce exactly. We convert the images captured from the simulation to grey-scale images. Third-person image data is often not available during test time; therefore, the agent’s policy instead receives the environment state as the link distances and velocities relative to the robot’s centre of mass (COM).
**2D Imitation Tasks ** The first group of evaluation environments contain a set of agents with different morphologies. In Figure 3 we show images from the 2D humanoid, dog and raptor environment. In these environments the rendering and simulation is in 2D, reducing the complexity of the control system and dynamics, and allowing for faster training times.
**3D Imitation Tasks ** For further evaluation, we compare the performance on 3D humanoid and two quadrupedal robot simulators used for Sim2Real research, the Laikago (Peng et al., 2020) and Pupper (Kau et al., 2020). The humanoid3d environment has multiple tasks that can be solved and used to generate data from other tasks for training the distance metric. These tasks include: walking, running, jogging, front-flips, back-flips, dancing, jumping, punching and kicking). To perform the data augmentation described in Section 4.1 we also construct data from a modified version of each task with a randomly generated playback speed modifier e.g. walking-dynamic-speed, which warps the demonstration timing. This additional data provides a richer understanding of distances in space and time with the distance metric. As we will show later in this section, VIRL learns policies that produce similar behaviour to the demonstration across these diverse tasks. We show example trajectories from the learned policies in Fig. 4 and in the supplemental Video. It takes days to train each policy in these results on a core machine with an Nvidia GTX1080 GPU.
Comparison Methods
We compare VIRL to two baselines that learn distances in observation space. The first is GAIfO (Torabi et al., 2018b) that trains a GAN to differentiate between images from the demonstration and images from the agent. The other is TCN, an image-to-image only siamese model (Nair et al., 2018). These methods have been used to perform types of imitation from observation before. However, as we will see, they either require a significant amount of data to train or result in lower-quality reward functions and, as a result, lower-quality policies.
5.1 Learning Performance
In Figure 5 we present results across different agent types including the 2D walking humanoid in Section 5(a), a 2D trotting dog in Fig. 5(b), and a 2D Raptor in Fig. 5(c). We compare VIRL directly with both GAIfO and TCN – the strongest comparable prior method. VIRL is able to provide a denser reward signal due to the LSTM-based distance that can express similarity between motion styles. While TCN is simpler, not using an LSTM model, it does not provide as rich of a reward signal for the agent to learn from. As a result, the learned reward can often be more sparse, slowing down learning. Similarly, GAIfO has difficulty learning a robust and smooth distance function with the little demonstration data available. This leads to very jerky motion or agent’s that stand still, matching the average pose of the demonstration, both of which are contained in the demonstration distribution but do not capture sequential-temporal-behaviour well.
In Figure 6 we compare VIRL with TCN, across many different and more challenging humanoid3d tasks in Fig. 6(a-d). Across these experiments, we observe that VIRL learns faster and produces higher value policies. In particular, we find that VIRL does very well compared to TCNs, which represents the strongest prior approach capable of performing this task of which we are aware. The humanoid3d tasks are particularly challenging as they have high control dimensionality causing the agent to deviate from the desired imitation behaviour easily. These tasks also contain higher levels of partial observability compared to the 2D experiments. In these environments, the temporal distance in VIRL provides a crucial additional reward signal that helps the agent match the style of the motion early on in training despite the partial information observations.
5.2 Analysis and Ablation
Sequence Encoding Using the learned sequence encoder, we compute the encodings across a collection of different motions and create a t-distributed stochastic neighbor embedding (t-SNE) embedding of the encodings (Maaten and Hinton, 2008). In Fig. 7(a) we plot motions both generated from the learned policy and the expert trajectories . Overlaps in specific areas of the space for similar classes across learned and expert data indicate a well-formed distance metric that does not separate expert and agent examples. There is also a separation between motion classes in the data, and the cyclic nature of the walking cycle is visible.
**Ablation **
In Fig. 7(b) we compare the importance of the spatial distance using the image encoder and temporal-LSTM-distance using the sequence encoder of VIRL. Using the recurrent representation alone (LSTM only) allows learning to progress quickly but can lead to difficulties in informing the policy on how to match the desired demonstration more precisely. On the other hand, using only the encoding between single frames as is done with TCN, slows learning due to little reward when the agent quickly becomes out-of-sync with the demonstration behaviour. The best result is achieved by combining the representations from these two models (VIRL).
Data augmentation comparisons. We conduct ablation studies for learning policies for 3D humanoid control in Fig. 8(a) and 8(b). We compare the effects of data augmentation methods, network models and the use of additional data from other tasks to train the siamese network ( additional tasks such as back-flips, see appendix 7.3 for more details on these tasks). We also compared using different length sequences for training, shorter (where the probability of the length decays linearly), uniform random and max length available. For these more complex and challenging three-dimensional humanoids (humanoid3d) control problems, the data augmentation methods, including EESP, increase average policy quality marginally compared to the importance of using multi-task data, this is likely related to the increased partial observably of these tasks. However, the addition of the auto-encoding losses to VIRL results in the quickest learning and highest value policies.
The experiment in Fig. 9(a) highlights the improvement to VIRL afforded by the recurrent sequence autoencoder (RSAE) model component from Eq. 4 that forces the encoding to contain enough information to decode the video sequence. While the experiment in Fig. 9(b) shows the dramatic improvement achieved when we include offline multi-task data for training the distance function. These methods are combined together to provide substantial learning performance increases across environments for VIRL. Further analysis is available in the Appendix, including additional comparison with TCN in Fig. 15(a-b) and details on training the distance model.
Sim2Real for Quadruped Robots
We use VIRL to train policies for two simulated quadrupeds shown in Fig. 10. These environments have been used for Sim2Real transfer(Tan et al., 2018; Peng et al., 2020). The resulting behaviours learned in simulation are available at: https://sites.google.com/view/virl1. We find that the Laikago environment is particularly challenging to learn; however, we can learn good policies on the smaller Stanford pupper in a day. This shows that VIRL can potentially be used to learn a control policy from a single video clip and transfer that policy to real hardware, however, we leave the details of the transfer process to future work.
6 Discussion and Conclusion
In this work, we have created a new method for learning imitative policies from a single demonstration. The method uses a Siamese recurrent network to learn a distance function in both space and time. This distance function learns to imitate video data where the agent’s observed state can be noisy and partially observed. We use this model to provide rewards for training an RL policy. By using data from other motion styles and regularization terms, VIRL produces policies that demonstrate similar behaviour to the demonstration in complex 3D control tasks. We found that the recurrent distance learned by VIRL was particularly beneficial when imitating demonstrations with greater partial observability.
One might expect that the distance metric should be pretrained to quickly understand the difference between a good and bad demonstration. However, we have found that in this setting, learning too quickly can destabilize learning, as rewards can change, which can cause the agent to diverge off to an unrecoverable policy space. In this setting, slower is better, as the distance metric may not yet be accurate. However, the learned distance function may be locally or relatively reasonable, which is enough to learn an acceptable policy. As learning continues, these two optimizations can converge together.
When comparing our method to GAIfO, we have found that GAIfO has limited temporal consistency. GAIfO led to learning jerky and overactive policies. The use of a recurrent discriminator for GAIfO, similar to our use of sequence-based distance, may mitigate some of these issues and is left for future work. It is challenging to produce results better than the carefully manually crafted reward functions used by the RL simulation environments that include motion phase information in the observations (Peng et al., 2018, 2017). However, we have shown that our method can compute distances in space and time and learns faster than current methods that can be used in this domain. A combination of beginning learning with our method and following with a manually crafted reward function could potentially lead to faster learning of high-quality policies if true state information is available. Still, as environments become increasingly more complex and real-world training becomes an efficient option, methods that can learn to imitate from only video demonstration enable access to a vast collection of motion information, extensive studies on multi-task learning, and give way to more natural methods to provide instructions to agents.
Acknowledgements and Disclosure of Funding
We want to acknowledge funding support from NSERC, CIFAR, and ElementAI/Service Now and compute support from ComputeCanada and ElementAI/Service Now.
7 Appendix
This section includes additional details related to VIRL.
7.1 Imitation Learning
Imitation learning is the process of training a new policy to reproduce the behaviour of some expert policy. BC is a fundamental method for imitation learning. Given an expert policy possibly represented as a collection of trajectories a new policy can be learned to match these trajectory using supervised learning.
[TABLE]
While this simple method can work well, it often suffers from distribution mismatch issues leading to compounding errors as the learned policy deviates from the expert’s behaviour during test time.
7.2 Inverse Reinforcement Learning
Similar to BC, Inverse Reinforcement Learning ( inverse reinforcement learning (IRL)) also learns to replicate some desired, potentially expert, behaviour. However, IRL uses the RL environment to learn a reward function that learns to tell the difference between the agent’s behaviour and the example data. Here we describe maximal entropy IRL (Ziebart et al., 2008). Given an expert trajectory a policy can be trained to produce similar trajectories by discovering a distance metric between the expert trajectory and trajectories produced by the policy .
[TABLE]
where is a learned cost function and is a causal entropy term. is the expert policy that is represented by a collection of trajectories. IRL is searching for a cost function that is low for the expert and high for other policies. Then, a policy can be optimized by maximizing the reward function .
7.3 Data
We are using the mocap data from the “CMU Graphics Lab Motion Capture Database” from 2002 (http://mocap.cs.cmu.edu/). To be thorough, we provide the processing at length. This data has been preprocessed to map the mocap markers to a human skeleton. Each recording contains the positions and orientations of the different joints of a human skeleton and can therefore directly be used to animate a simulated humanoid mesh. This is a standard approach that has been widely used in prior literature (Gleicher, 1998; Rosales and Sclaroff, 2000; Lee et al., 2002). To be precise: at each mocap frame, the joints of a humanoid mesh model are set to the positions and orientations of their respective values in the recording. If a full humanoid mesh is not available, it is possible to add capsule mesh primitives between each recorded joint. This 3D mesh model is then rendered to an image through a 3rd person camera that follows the center of mass of the mesh at a fixed distance.
For the humanoid experiments, imitation data for other tasks was used to help condition the distance metric learning process. These include motion clips for running, backflips, frontflips, dancing, punching, kicking and jumping along with the desired motion. The improvement due to these additional unsupervised training data generation mechanisms are shown in Fig. 8(a).
The Sim2Real environments do not include video demonstrations. To create video data we use a similar method as in the other simulations. The available motion capture data is used in the simulation to control a kinematic character from which 3rd person video data of that agent is collected.
7.4 Training Details
The learning simulations are trained using graphics processing unit (GPU)s. The simulation is not only simulating the interaction physics of the world but also rendering the simulation scene to capture video observations. On average, it takes days to execute a single training simulation. The rendering process and copying the images from the GPU is one of the most expensive operations with VIRL. We collect samples between training rounds. The batch size for TRPO is . The kl term is .
The simulation environment includes several different tasks represented by a collection of motion capture clips to imitate. These tasks come from the tasks created in DeepMimic (Peng et al., 2018). We include all humanoid tasks in this dataset. The simulation uses RSI to randomly sample start states for the agent and expert to begin. This works by first uniformly randomly selecting a time in the expert demonstration and then synchronizing the learning agent with that time in the expert demonstration. This has shown to be very helpful across many prior papers to boost learning and is used for all algorithms in this paper.
In Alg. 1 we include an outline of the algorithm used for the method and a diagram in Fig. 12. The simulation environment produces three types of observations, the agent’s proprioceptive pose, the image observation of the agent and the image-based observation of the expert demonstration. The images are grayscale . In Figure 13 we show a diagram of the network model using example data from the the humanoid3d walking task. Different network structures were evaluated, this structure with the loss defined in Equation 4 provided the best performance.
7.5 Distance Function Training
In Fig. 14(a), the learning curve for the sequence-based Siamese network is shown during a pretraining phase. We can see the overfitting portion the occurs during RL training. This overfitting can lead to poor reward prediction during the early phase of training. In Fig. 14(b), we show the training curve for the recurrent Siamese network after starting training during RL. After an initial distribution adaptation, the model learns smoothly, considering that the training data used is continually changing as the RL agent explores.
It can be challenging to train a sequenced-based distance function. One particular challenge is training the distance function to be accurate across the space of possible states. We found that a good strategy was to focus on the earlier parts of the episode. When the model is not accurate on states earlier in the episode, it may never learn how to get into good states later, even if the distance function understands those better. Therefore, when constructing batches to train the RNN on, we give a higher probability of starting earlier in episodes EESP. We also give a higher probability of shorter sequences as a function of the average episode length. As the agent gets better average episodes length increases, so to will the randomly selected sequence windows.
We found in our experiments that keeping the same in-order sequence for decoding forced the encoding model to encode long-term temporal information from the beginning of training. This is particularly challenging to cope with as the RL policy is exploring different state distributions, further exasperating the challenging problem of learning good temporal representations. Instead, we reverse the decoding sequence which allows the training model to pickup on shorter term temporal dependencies quickly. This shorter but more consistent representation provides better signal to the RL agent.
7.6 Distance Function Use
We find it helpful to normalize the distance metric outputs using where scales the filtering width. This normalization is a standard method to convert distance-based rewards to be positive, which makes it easier to handle episodes that terminate early (Peng and van de Panne, 2017; Peng et al., 2018, 2019). Early in training, the distance metric often produces large, noisy values. The RL method regularly tracks reward scaling statistics; the initial high variance data reduces the significance of better distance metric values produced later on by scaling them to small numbers. The improvement of using this normalized reward is shown in Fig. 15(a). In Fig. 15(b) we compare to a few baseline methods. The manual version uses a carefully engineered reward function from (Peng et al., 2017).
7.7 Positive and Negative Examples
We use two methods to generate positive and negative examples. The first method is similar to TCN, where we can assume that sequences that overlap more in time are more similar. We generate two sequences for each episode, one for the agent and one for the imitation motion. Here we list the methods used to alter sequences for positive pairs.
Adding Gaussian noise to each state in the sequence (mean and variance ) 2. 2.
Out of sync versions where the first state from the first and the last ones from the second sequence are removed. 3. 3.
Duplicating the first state in either sequence 4. 4.
Duplicating the last state in either sequence
We alter sequences for negative pairs by
Reversing the ordering of the second sequence in the pair. 2. 2.
Randomly picking a state out of the second sequence and replicating it to be as long as the first sequence. 3. 3.
Randomly shuffling one sequence. 4. 4.
Randomly shuffling both sequences. 5. 5.
Using one sequence from the expert and one from the agent. We call these adversarial sequences pairs. 6. 6.
In the examples that include additional motion classes, the negatives are selected from the other classes.
The second method we use to create positive and negative examples is by including data for additional classes of motion. These classes denote different task types. For the humanoid3d environment, we generate data for walking-dynamic-speed, running, backflipping and front-flipping. Pairs from the same tasks are labelled as positive, and pairs from different classes are negative.
Viewpoint Invariance
Because we perform many types of data augmentations, clipping, warping, cropping, etc, the video data can be collected from viewpoints with different angles and distances, as long as most of the agent is captured by the video. The data augmentations are designed to help increase the diversity of data, but they also result in VIRL being able to compute distances from noisy data from different view locations.
7.8 Hyper Parameter Analysis
To determine the best values for in Equation 4 we perform grid search over possible values in increments where . This evaluation is performed over all tasks using 3 seeds on each task in section 5 and the parameters that results in the best learning performance are selected. The final parameters are designed to be robust to new environments and may not require additional tuning.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abbeel and Ng (2004) Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty-first International Conference on Machine Learning , ICML ’04, pages 1–, New York, NY, USA, 2004. ACM. ISBN 1-58113-838-5. doi: 10.1145/1015330.1015430 . URL http://doi.acm.org/10.1145/1015330.1015430 .
- 2Argall et al. (2009) Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems , 57(5):469 – 483, 2009. ISSN 0921-8890. doi: https://doi.org/10.1016/j.robot.2008.10.024 . URL http://www.sciencedirect.com/science/article/pii/S 0921889008001772 .
- 3Blakemore and Decety (2001) Sarah-Jayne Blakemore and Jean Decety. From the perception of action to the understanding of intention. Nature reviews neuroscience , 2(8):561–567, 2001.
- 4Brown et al. (2019) Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International Conference on Machine Learning , 2019.
- 5Brown et al. (2020) Daniel S Brown, Wonjoon Goo, and Scott Niekum. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In Conference on Robot Learning , pages 330–359, 2020.
- 6Chebotar et al. (2017) Yevgen Chebotar, Karol Hausman, Marvin Zhang, Gaurav Sukhatme, Stefan Schaal, and Sergey Levine. Combining model-based and model-free updates for trajectory-centric reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 703–711, 2017.
- 7Chopra et al. (2005) Sumit Chopra, Raia Hadsell, and Yann Le Cun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , volume 1, pages 539–546. IEEE, 2005.
- 8Chung et al. (2015) Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. Advances in neural information processing systems , 28, 2015.
