Active Learning of Probabilistic Movement Primitives

Adam Conkey; Tucker Hermans

arXiv:1907.00277·cs.RO·May 5, 2022

Active Learning of Probabilistic Movement Primitives

Adam Conkey, Tucker Hermans

PDF

TL;DR

This paper introduces an active learning method for Probabilistic Movement Primitives (ProMPs) that efficiently selects demonstrations to improve task generalization, demonstrated through grasping experiments on a robot.

Contribution

It proposes a novel active learning approach with a new uncertainty sampling metric, Greatest Mahalanobis Distance, for better generalization with fewer demonstrations.

Findings

01

The new sampling method outperforms random sampling in generalization.

02

Fewer demonstrations are needed to achieve effective task learning.

03

The approach is validated on real robot grasping tasks.

Abstract

A Probabilistic Movement Primitive (ProMP) defines a distribution over trajectories with an associated feedback policy. ProMPs are typically initialized from human demonstrations and achieve task generalization through probabilistic operations. However, there is currently no principled guidance in the literature to determine how many demonstrations a teacher should provide and what constitutes a "good" demonstration for promoting generalization. In this paper, we present an active learning approach to learning a library of ProMPs capable of task generalization over a given space. We utilize uncertainty sampling techniques to generate a task instance for which a teacher should provide a demonstration. The provided demonstration is incorporated into an existing ProMP if possible, or a new ProMP is created from the demonstration if it is determined that it is too dissimilar from existing…

Equations34

p (τ ∣ w, Σ_{y}) = t \prod p (y_{t} ∣ Ψ_{t} w, Σ_{y})

p (τ ∣ w, Σ_{y}) = t \prod p (y_{t} ∣ Ψ_{t} w, Σ_{y})

p (τ ∣ θ, Σ_{y})

p (τ ∣ θ, Σ_{y})

μ_{w}^{*}

μ_{w}^{*}

Σ_{w}^{*}

p (τ ∣ M) = j = 1 \sum J π_{j} N (τ ∣ Ψ μ_{w}^{j}, Ψ^{T} Σ_{w}^{j} Ψ + Σ_{y})

p (τ ∣ M) = j = 1 \sum J π_{j} N (τ ∣ Ψ μ_{w}^{j}, Ψ^{T} Σ_{w}^{j} Ψ + Σ_{y})

d (w, θ_{j}) = (w - μ_{w}^{j})^{T} (Σ_{w}^{j})^{- 1} (w - μ_{w}^{j})

d (w, θ_{j}) = (w - μ_{w}^{j})^{T} (Σ_{w}^{j})^{- 1} (w - μ_{w}^{j})

δ_{j} = max {d (w_{i}, θ_{j}) : \frac{d ( w _{i} , θ _{j} ) - M _{j}}{M A D _{j}} < β}

δ_{j} = max {d (w_{i}, θ_{j}) : \frac{d ( w _{i} , θ _{j} ) - M _{j}}{M A D _{j}} < β}

μ_{w}^{j}

μ_{w}^{j}

Σ_{w}^{j}

η^{*} = η \in C_{d} argmax U (η)

η^{*} = η \in C_{d} argmax U (η)

U_{l c} (η) = η \in C_{d} argmax [1 - p (z_{1} ∣ η)]

U_{l c} (η) = η \in C_{d} argmax [1 - p (z_{1} ∣ η)]

U_{mm} (η) = η \in C_{d} argmax [p (z_{2} ∣ η) - p (z_{1} ∣ η)]

U_{mm} (η) = η \in C_{d} argmax [p (z_{2} ∣ η) - p (z_{1} ∣ η)]

U_{m e} (η) = η \in C_{d} argmax - z = 1 \sum J p (z ∣ η) lo g p (z ∣ η)

U_{m e} (η) = η \in C_{d} argmax - z = 1 \sum J p (z ∣ η) lo g p (z ∣ η)

U_{g m} (η) = η \in C_{d} argmax j min d (η, θ_{η}^{j})

U_{g m} (η) = η \in C_{d} argmax j min d (η, θ_{η}^{j})

g (^{0} T_{o bj}) =^{0} T_{o bj} \cdot^{o bj} T_{ee} =^{0} T_{ee}

g (^{0} T_{o bj}) =^{0} T_{o bj} \cdot^{o bj} T_{ee} =^{0} T_{ee}

p (y_{t}) = r = 1 \sum R β_{r} N (y_{t} ∣ μ_{y_{t}}^{r}, Σ_{y_{t}}^{r})

p (y_{t}) = r = 1 \sum R β_{r} N (y_{t} ∣ μ_{y_{t}}^{r}, Σ_{y_{t}}^{r})

p (η ∣ z = j) = r = 1 \sum R β_{r} N (\tilde{y}_{t} ∣ Ψ_{t} \tilde{μ}_{w}^{j}, Ψ_{t}^{T} \tilde{Σ}_{w}^{j} Ψ_{t} + Σ_{y})

p (η ∣ z = j) = r = 1 \sum R β_{r} N (\tilde{y}_{t} ∣ Ψ_{t} \tilde{μ}_{w}^{j}, Ψ_{t}^{T} \tilde{Σ}_{w}^{j} Ψ_{t} + Σ_{y})

p (z ∣ η)

p (z ∣ η)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Active Learning of Probabilistic Movement

Primitives

Adam Conkey1 and Tucker Hermans1,2 1School of Computing; Robotics Center; University of Utah, USA. 2NVIDIA, USA. Email: [email protected], [email protected]

Abstract

A Probabilistic Movement Primitive (ProMP) defines a distribution over trajectories with an associated feedback policy. ProMPs are typically initialized from human demonstrations and achieve task generalization through probabilistic operations. However, there is currently no principled guidance in the literature to determine how many demonstrations a teacher should provide and what constitutes a “good” demonstration for promoting generalization. In this paper, we present an active learning approach to learning a library of ProMPs capable of task generalization over a given space. We utilize uncertainty sampling techniques to generate a task instance for which a teacher should provide a demonstration. The provided demonstration is incorporated into an existing ProMP if possible, or a new ProMP is created from the demonstration if it is determined that it is too dissimilar from existing demonstrations. We provide a qualitative comparison between common active learning metrics; motivated by this comparison we present a novel uncertainty sampling approach named “Greatest Mahalanobis Distance.” We perform grasping experiments on a real KUKA robot and show our novel active learning measure achieves better task generalization with fewer demonstrations than a random sampling over the space.

I Introduction

Learning from demonstration [1, 2, 3] offers a promising approach for robot users untrained in programming to command robots to perform common manipulation tasks. By teaching the robot through demonstration, the user can provide manipulation expertise without needing to be an expert in robotics. Probabilistic Movement Primitives (ProMPs) provide a useful policy representation for generating adaptable robot motion learned from demonstration [4]. A ProMP encodes a distribution over trajectories and is typically initialized with several demonstrations from a human teacher. Task generalization to new goals and contexts is primarily achieved by conditioning the trajectory distribution on desired trajectory waypoints. This generalization mechanism has been successfully applied in a variety of applications including grasping objects while avoiding obstacles [5], relocating objects of unknown weight [6], collaborative assembly tasks [7], and robot table tennis [8].

However, ProMPs require an indeterminate number of demonstrations to confidently generalize over the desired task space and appropriately estimate the associated task covariance [9]. Many real-world tasks require hundreds of demonstrations to fully estimate the demonstration covariance [10]. If too few demonstrations are provided, numerical issues arise in the form of singular covariance matrices, and it is common to use a non-informative prior for the covariance in order to sidestep this issue [4, 11, 12]. However, the generalization capability of a ProMP can be compromised when using parameters that do not adequately estimate the true distribution associated with a task [4, 12]. The current alternatives available to the teacher are to either expend undue effort to exhaustively demonstrate a task to the robot, or to attempt to capture, in only a small number of demonstrations, the task variation necessary to achieve the desired generalization. There is a need for a third option that guides the teacher to provide only those demonstrations necessary to ensure the desired task generalization is achievable.

In this paper, we present a novel active learning procedure for learning a library of ProMPs from demonstration that is capable of task generalization over a desired region. We frame this as an active learning problem by conceiving of each ProMP in the library as its own class, with the guiding intuition that we want to fully “classify” the space, i.e. achieve full ProMP coverage of the space. We adopt an uncertainty sampling approach [13] that enables the robot to generate the task sample for which it is least likely to “predict” correctly, i.e. generalize to with a ProMP. By allowing the learner to generate task samples to be “labeled”, i.e. demonstrated by the teacher, we remove the burden of the teacher to decide which demonstration to provide next. Additionally, by informing task selection with uncertainty measures, we reduce the total number of demonstrations necessary to achieve a task than if demonstrations are given in an ad hoc manner.

We provide a qualitative comparison of different uncertainty sampling measures commonly used for active learning in supervised learning settings: Least Confident, Maximum Entropy, and Minimum Margin. We show that these measures are not suitable for promoting task generalization with ProMPs. We propose a new measure we call Greatest Mahalanobis Distance that effectively generates task instances that are not in close proximity to any existing ProMP distribution. We demonstrate with grasping experiments on a real KUKA robot that our method affords task generalization with fewer demonstrations more effectively than randomly sampling over the space.

To briefly summarize our contributions, in this paper we:

Formalize an active learning approach for learning manipulation tasks from demonstration using Probabilistic Movement Primitives. 2. 2.

Provide qualitative comparisons of the three most common uncertainty sampling techniques: Least Confident, Minimum Margin, and Maximum Entropy. 3. 3.

Present a novel uncertainty sampling function suitable for building a mixture of ProMPs capable of generalizing a manipulation task within a given region. 4. 4.

Leverage the probabilistic information encoded in ProMP policies to automatically determine which ProMP in the mixture a new demonstration should be incorporated into, and which ProMP to execute for a new task instance.

We structure the remainder of the paper as follows. We first review related work in active learning from demonstration in Section II. We then provide a brief technical overview of ProMPs in Section III-A and define our novel approach to learning a library of ProMPs through active learning in Sections III-B–III-D. We present an overview of our experimental setup in Section IV and describe the corresponding results in Section V for grasping experiments performed on a physical KUKA LBR4+ robot. We conclude in Section VI with some final remarks and directions for future work.

II Related Work

Active learning, where a learner actively poses queries to a teacher for input to reduce sample complexity, has been widely applied in supervised learning settings [13]. Our approach is most suitably situated in the literature on active learning from demonstration [9, 14, 15, 16, 17], also referred to as active imitation learning [18, 19], in which the learner generates task instances for which the teacher may provide a demonstration. Active learning from demonstration has been applied to autonomous navigation [14], object seeking with a quadruped robot [15], grasping objects [16], reaching to task space positions with a manipulator [9], and generating smooth robot motion from a latent-space encoding [17]. Also included in this area are approaches where the learner does not request full task demonstrations, but instead asks for the action to take in the particular state that it is in [18, 19, 20]. These approaches, however, are only applicable to finite action spaces where actions are easily specified by a teacher.

The approaches most closely related to ours are those using active learning to learn Dynamic Movement Primitives (DMPs) for grasping [16] and reaching tasks [9]. In [16], a hybrid system is presented such that a high-level active learner generates grasp configurations based on a variant of Upper Confidence Bound (UCB) policies [21], and a low-level reactive DMP controller executes the grasp motion based on a task demonstration. In [9], the robot incrementally learns DMPs for reaching pre-defined positions in its workspace. A Gaussian process is used for sampling trajectories with an associated variance. If a function of the variance is below an uncertainty threshold, then an existing DMP is used with the goal appropriately adapted. Otherwise, the human user is asked for a new demonstration to reach the new goal position. By utilizing ProMPs in our method instead of DMPs, we are able to achieve greater generalization capabilities [4] while leveraging the probabilistic information already encoded in the policy representation to compute confidence measures. Additionally, we are able to provide a probabilistic measure of the robot’s ability to generalize a task in a given region, as opposed to [9] which can only say for a given instance whether or not the robot is confident it can execute the motion. Our approach therefore has the advantage that the robot, after learning a task, can be deployed with an associated uncertainty estimate that it will succeed on any task instance it is given to perform.

Also relevant to our approach is work in the area of active learning for parameterized skills [22]. In [22] the agent selects tasks to practice in a reinforcement learning setting with the objective of optimizing for expected improvement in skill performance. Task competency is measured over a recursively split goal space in [23] for an intrinsically-motivated agent. Active Contextual Policy Search [24] considers a learner that generates task contexts to condition a high-level policy on, such that a lower-level policy can be optimized to maximize an intrinsic reward function. These works are each applied in reinforcement learning settings and are agnostic to any particular policy representation. Our approach, on the other hand, makes use of human demonstrations and, by committing to a particular policy representation (ProMPs), we are able to compute task competency in a unified manner utilizing information from the policy representation itself.

III Methods

We first provide a brief background on ProMPs in Section III-A to introduce the concepts relevant to our contributions. We describe our approach to learning a mixture of ProMPs in Section III-B. In Section III-C we present our novel approach for active learning of ProMPs and discuss the methods we compare. Finally, we provide details in Section III-D on how to use our active learning method for the concrete task we consider in our experiments: reaching to grasp an object.

III-A Background

We utilize a formulation of ProMPs that closely parallels that of [4]. The ProMP trajectory distribution has the general form

[TABLE]

where $\bm{\tau}=[\bm{y}_{0},\dots,\bm{y}_{T}]$ is a trajectory of the state $\bm{y}_{t}\in\mathcal{S}$ for state space $\mathcal{S}\subseteq\mathbb{R}^{d}$ , $\bm{\Psi}_{t}\in\mathbb{R}^{d\times dn}$ is a block-diagonal matrix of $n$ basis functions for each dimension of the state, $\bm{w}\in\mathbb{R}^{dn}$ is a weight vector, and $\bm{\Sigma}_{\bm{y}}$ is the observation noise. We assume, as in [4], that the time-dependent distributions are Gaussian, i.e. $p(\bm{y}_{t}\mid\bm{\Psi}_{t}\bm{w},\bm{\Sigma}_{\bm{y}})=\mathcal{N}(\bm{y}_{t}\mid\bm{\Psi}_{t}\bm{w},\bm{\Sigma}_{\bm{y}})$ . This results in $p(\bm{\tau}\mid\bm{w},\bm{\Sigma}_{\bm{y}})$ being Gaussian as well since it is a product of Gaussian distributions.

We parameterize the distribution with $\bm{\theta}=\{\bm{\mu}_{\bm{w}},\bm{\Sigma}_{\bm{w}}\}$ and marginalize out the weights such that

[TABLE]

Task generalization is achieved by conditioning $p(\bm{w}\mid\bm{\theta})$ on a desired trajectory waypoint $\bm{y}_{t}^{*}$ with covariance $\bm{\Sigma}_{\bm{y}_{t}}^{*}$ . The updated parameters $\bm{\theta}^{*}=\{\bm{\mu}_{\bm{w}}^{*},\bm{\Sigma}_{\bm{w}}^{*}\}$ are computed by

[TABLE]

This closed-form update is possible because we assume, as in [4], that $p(\bm{w}\mid\bm{\theta})$ is Gaussian.

III-B Learning a Mixture of ProMPs from Demonstration

We employ a mixture of multiple ProMPs parameterized as $\mathcal{M}=\{(\bm{\theta}_{1},\pi_{1}),\dots,(\bm{\theta}_{J},\pi_{J})\}$ where $\bm{\theta}_{j}=\{\bm{\mu}_{\bm{w}}^{j},\bm{\Sigma}_{\bm{w}}^{j}\}$ , since it is known that a single ProMP is not sufficient to properly characterize a given space [7]. We formalize the mixture as

[TABLE]

where $\pi_{j}\in[0,1]$ are mixture coefficients and $\bm{\mu}_{\bm{w}}^{j}$ , $\bm{\Sigma}_{\bm{w}}^{j}$ are the mean and covariance associated with the $j^{th}$ ProMP.

The mixture $\mathcal{M}$ is learned incrementally over time as new demonstrations are acquired. In order to incorporate a new demonstration, we first learn a weight vector $\bm{w}$ from the demonstration using Ridge regression as in [4]. For the first demonstration received, a new ProMP is created with mean $\bm{\mu}_{\bm{w}}^{1}=\bm{w}$ and covariance $\bm{\Sigma}_{\bm{w}}^{1}=\gamma\mathbf{I}$ , where $\mathbf{I}$ is the identity matrix and $\gamma\in\mathbb{R}^{+}$ is a scaling parameter. This serves as a non-informative prior for the covariance [4]. For subsequent demonstrations, we must determine which ProMP in the mixture the new demonstration should be incorporated into. In contrast to previous work [25] that learns a separate model for a gating function to the mixture components, we directly utilize the probabilistic information encoded in the learned ProMPs to determine which ProMP a new demonstration should be incorporated into. We use the Mahalanobis distance [26] as a measure of disparity between $\bm{w}$ and each ProMP distribution $\bm{\theta}_{j}$ given by

[TABLE]

A demonstration is incorporated into the $j^{th}$ ProMP if the Mahalanobis distance between the learned weight vector and the ProMP distribution falls below a disparity threshold $\delta\in\mathbb{R}^{+}$ , i.e. if $d(\bm{w},\bm{\theta}_{j})\leq\delta$ . Instead of choosing a fixed threshold, we compute a robust measure of a disparity threshold for each ProMP utilizing the ProMP generative model. For each ProMP, we create a set of weight vector samples $\mathcal{W}_{j}$ and compute the value of Equation (6) for each $\bm{w}_{i}\in\mathcal{W}_{j}$ . We use Median Absolute Deviation outlier filtering [27] to compute the threshold

[TABLE]

where $M_{j}$ is the median Mahalanobis distance of the sample weights $\bm{w}_{i}\in\mathcal{W}_{j}$ to the ProMP distribution $\bm{\theta}_{j}$ and $MAD_{j}$ is the Median Absolute Deviation computed by $MAD_{j}=\text{med}\left(|d(\bm{w}_{i},\bm{\theta}_{j})-M_{j}|\right)$ . The parameter $\beta$ is an easily tuned parameter that governs how many outliers are discarded and has standard associated values ranging from approximately 3 (few outliers discarded) to 2 (many outliers discarded) [27].

Once it is determined that $d(\bm{w},\bm{\theta}_{j})\leq\delta_{j}$ , the new demonstration is incorporated into the $j^{th}$ ProMP by updating the ProMP’s distribution parameters as

[TABLE]

The mean $\bm{\mu_{w}}^{j}$ is computed as the Maximum Likelihood Estimate (MLE) where $N$ is the number of samples the ProMP is learned from, including the newly acquired sample. $\bm{\Sigma_{w}}^{j}$ is updated as the Maximum A Posteriori (MAP) estimate under an Inverse Wishart Prior, which amounts to a convex combination of a positive semi-definite prior $\bm{\Sigma}_{0}$ and the MLE of the sample covariance [28]. We adopt the method of [11] and set $\bm{\Sigma}_{0}$ to be the estimate of $\bm{\Sigma_{w}}^{j}$ from the previous learning iteration. This ensures that $\bm{\Sigma_{w}}^{j}$ is always full rank (due to the initial diagonal prior) and that the parameter estimate is not unduly influenced by a new sample. We found this to be important in our experiments since, in general, the number of demonstrations each ProMP is learned from is considerably smaller than the dimensionality of the weight space; using an ill-conditioned matrix in the probability computations can result in non-informative values.

If it happens that $d(\bm{w},\bm{\theta}_{j})>\delta_{j}$ for every ProMP in the mixture, then we create a new ProMP with an uninformative prior as described previously. When initializing all ProMPs we set the initial $\bm{\Sigma_{0}}=\sigma\bm{I}$ for a small value of $\sigma\in\mathbb{R}^{+}$ .

III-C Active Learning of ProMPs

The active learner’s objective is to learn a mixture of ProMPs $\mathcal{M}$ that achieves task generalization over some region of its environment. We formalize this region by defining a continuous context space $\mathcal{C}$ that specifies the task to be performed in terms of context variables [29] (e.g. the pose of an object to be grasped). We assume there is a subset $\mathcal{C}_{d}\subseteq\mathcal{C}$ over which task generalization is desired.

We estimate the achievable feasible region by the coverage achieved by the mixture of ProMPs at the timestep relevant for the task context $\bm{\eta}$ . Because the context variable is not, in general, a direct subset of the ProMP state, we allow for a mapping $g:\mathcal{C}\rightarrow\mathcal{S}$ between the context space $\mathcal{C}$ and the ProMP state space $\mathcal{S}$ . We define one such mapping below in Section III-D suitable for our experimental grasping task.

We formalize our active learning problem by conceiving of each ProMP in the mixture to be its own class. We then employ active learning through uncertainty sampling [13], in which the learner generates a new task instance for which the teacher can provide a demonstration governed by

[TABLE]

where $\bm{\eta}\in\mathcal{C}_{d}$ is a context variable sufficient to describe the task and $U(\bm{\eta})$ is an uncertainty sampling function that measures the uncertainty the learner has about characterizing a given task instance as being a member of one of the available classes. We qualitatively compare the three most common uncertainty sampling measures [13]:

Least Confident:

[TABLE]

Minimum Margin:

[TABLE]

Maximum Entropy:

[TABLE]

In Equations 11–13, $p(z\mid\bm{\eta})$ indicates the probability of a class label $z$ being attributed to instance $\bm{\eta}$ , where the class label corresponds to any one of the $J$ ProMPs. In Equations 11 and 12, $z_{1}=\operatorname*{argmax}_{z}p(z\mid\bm{\eta})$ is the most likely label for instance $\bm{\eta}$ while $z_{2}$ in Eq. 12 is the second most likely label. Intuitively, the Least Confident measure (Eq. 11) selects the task instance $\bm{\eta}^{*}$ whose highest probability over all labels $z\in\mathcal{Z}$ is lowest compared to all other instances $\bm{\eta}\in\mathcal{C}_{d}$ . The Minimum Margin measure (Eq. 12) chooses the instance with the greatest ambiguity between its two most likely classifications. The Maximum Entropy measure (Eq. 13) identifies the instance with the highest label uncertainty over all classes.

We define an additional, novel uncertainty sampling function:

Greatest Mahalanobis Distance

[TABLE]

where $d(\cdot)$ is the Mahalanobis distance defined in Equation 6, $j$ indexes over ProMPs, and $\bm{\theta}_{\bm{\eta}}^{j}=\{\bm{\mu}_{\bm{\eta}}^{j},\bm{\Sigma}_{\bm{\eta}}^{j}\}$ defines a distribution over the context variable achieved by mapping the $j^{th}$ ProMP distribution parameters to the context space. We provide details for the specific mapping we utilize in this paper below in Section III-D. Our Greatest Mahalanobis Distance approach is similar to Least Confident, but instead of choosing the instance with the lowest probability over classes (ProMPs), it selects the instance whose closest ProMP distribution is the farthest away.

We found that in practice, the Mahalanobis distance is less susceptible to computational issues than the probability values computed for the other uncertainty sampling functions. The density function for a Gaussian distribution requires dividing by the determinant of the covariance matrix, which is equivalent to dividing by the product of the eigenvalues of the covariance matrix. This value can be extremely small when the covariance is estimated from a small sample set, causing the computation to become unstable. We show in our experiments in Section V that the Greatest Mahalanobis Distance objective encourages the learner to select instances far away from instances it has already received demonstrations for, while the other uncertainty sampling functions tend to “compete” along the boundaries of the regions covered by adjacent ProMPs.

Given the new task instance $\bm{\eta}^{*}$ generated by the uncertainty sampling optimization, the teacher provides a demonstration. The demonstration is then incorporated into the mixture of ProMPs as described in Section III-B. The procedure iterates until a stopping criteria is met, e.g. the task success rate over a validation set reaches an acceptable percentage.

III-D Example ProMP Context

To be concrete in our formulation, we present a context mapping for the task of grasping an object placed arbitrarily on a surface. We use this mapping in our experiments presented later in Section V. The task requires the robot to pick up an object located arbitrarily on a table surface. The ProMP state consists of the end-effector pose with respect to the robot’s base frame ${}^{0}T_{ee}$ (e.g. from forward kinematics of the joint state), while the context space is the pose of the object with respect to the base frame ${}^{0}T_{obj}$ (e.g. from an object tracker using an RGB-D camera [30]). Once a desired end-effector pose in the object frame ${}^{obj}T_{ee}$ is known, the mapping $g:\mathcal{C}\rightarrow\mathcal{S}$ from context space to state space, as described in Section III-C, is achieved by a simple coordinate frame transformation:

[TABLE]

The pose ${}^{obj}T_{ee}$ could be specified manually or from the output of a grasp planner; however, we instead employ a Gaussian Mixture Model (GMM) over successful end-effector poses in the object frame. The GMM is defined by

[TABLE]

where $\beta_{r}\in[0,1]$ are the mixture coefficients and $\bm{\mu}_{\bm{y}_{t}}^{r},\bm{\Sigma}_{\bm{y}_{t}}^{r}$ are the mean and covariance of the end-effector pose in the object frame for the $r^{th}$ component. A visualization of the mean components learned from the demonstrations given in our experiments can be seen in Figure 2. Using the known pose of the object in the base frame, we transform each $\bm{\mu}_{\bm{y}_{t}}^{r}$ , $\bm{\Sigma}_{\bm{y}_{t}}^{r}$ to get $\bm{\tilde{\mu}}_{\bm{y}_{t}}^{r}$ , $\bm{\tilde{\Sigma}}_{\bm{y}_{t}}^{r}$ , which are the mean and covariance of the end-effector with respect to the base frame.

We leverage these parameters as the condition points for the ProMP, i.e. we set $\bm{y}_{t}^{*}=\bm{\tilde{\mu}}_{\bm{y}_{t}}^{r}$ and $\bm{\Sigma}_{\bm{y}_{t}}^{*}=\bm{\tilde{\Sigma}}_{\bm{y}_{t}}^{r}$ in Equations 3 and 4. We then compute the probability of a particular task being achievable by the ProMP mixture as

[TABLE]

where $z=j$ indicates the $j^{th}$ ProMP in the mixture; $\bm{\tilde{y}}$ is the ProMP state generated from the transformation of context variable; $\bm{\tilde{\mu}}_{\bm{w}}^{j}$ and $\bm{\tilde{\Sigma}}_{\bm{w}}^{j}$ are the posterior distribution parameters in weight space computed from Equations 3 and 4; and $\beta_{r}$ are the same as in Equation 16.

We interpret Equation 17 as a measure of how capable the ProMP is of achieving the task when conditioned on the task-relevant pose determined by the context variable. There is little guidance in the literature for how to set $\bm{\Sigma}_{\bm{y}_{t}}^{*}$ and it is typically taken to be a scaled identity matrix [31]. We highlight this key advantage of our choice to learn the GMM: we obtain meaningful values for both the mean and the covariance for use in this conditioning operation.

We note that we are not able to directly compute the probabilities $p(z\mid\bm{\eta})$ for the uncertainty sampling measures (Eqs. 11–13). Thus we use Bayes theorem and Eq. 17 giving

[TABLE]

where $z_{i}$ ranges over all possible classes. We use a uniform, uninformative prior for $p(z)$ to reflect our assumption that without further knowledge, any ProMP in the mixture might potentially be used to execute a task. More intelligent priors are worth exploring and we leave this for future work.

IV Experimental Setup

We illustrate the qualitative differences of the active learning strategies under consideration using a simple grasping task. The goal is for the robot to be able to pick up a drill placed in an arbitrary planar pose on a table in the robot’s reachable workspace, as illustrated in Figures 1 and 5. We chose this task because it affords an easily discernible comparison of the different methods while providing a non-trivial space to optimize over. In order to maintain consistency in the demonstrations available to each comparison method, we discretized the sampling space into a grid with planar positions in 5cm intervals and planar orientations in increments of 45 degrees. The result is a total of approximately 700 possible planar poses for selection. We provided one demonstration for each of these samples through kinesthetic teaching of the robot in gravity compensation mode.

We provide a qualitative characterization of the three uncertainty sampling methods discussed in Section III-C; namely, Least Confident, Minimum Margin, and Maximum Entropy. We show that each of these measures computed over the ProMP probabilities exhibits undesired behavior in the context of active learning for ProMPs. We then provide a more rigorous quantitative analysis comparing our proposed method of Greatest Mahalanobis Distance to a random-selection strategy. We present results of executing the grasping task using both methods and show that our method provides better task generalization over the space with fewer demonstrations required from the teacher.

We performed our experiments111Data is available at http://bit.ly/al_promp_data.222Code is available at http://bit.ly/al_promp_code.333Video is available at https://youtu.be/na91UyidDvE. on a KUKA LBR4+ robot arm equipped with a ReFlex TakkTile hand [32, 33]. Given the Cartesian waypoints generated from a ProMP policy, we formulate a Sequential Quadratic Program to obtain a joint trajectory by minimizing the L2 squared error between the end-effector pose and the Cartesian waypoints [34]. We tracked the resulting trajectory with a real-time Orocos [35] joint space PD controller operated at 500Hz. Grasps were performed by assuming a canonical preshape and closing the hand until contact was made (as detected by the TakkTile [36] pressure sensors on the ReFlex fingers). We then drove the motors a small additional amount to achieve a firm grasp, following the control approach from [37]. Once grasped, a pre-defined lifting sequence was executed to lift the object approximately 20cm above the table. A grasp is considered successful if the object is still in the robot’s grasp at the end of the lifting sequence.

Prior to executing any trajectory on the physical robot, we perform a kinematic simulation of the robot with the environment model overlaid in rviz. We do not execute any trajectory that is clearly dangerous in terms of colliding with the environment at a non-trivial velocity.

We use the drill from the YCB dataset [38] as the object to be grasped by the robot, as shown in Figure 1. We track the pose of the object using the Bayesian object tracker described in [30]. The pose is visualized in rviz and overlaid on the camera feed coming from an ASUS X-tion Pro RGB-D camera. Selected task poses for the object are also displayed in this way, and the human user utilizes the displays to align the object pose with the generated task instance pose.

V Experimental Results

V-A Qualitative Comparison of Uncertainty Sampling

We perform a qualitative comparison of the four uncertainty sampling methods described in Section III-C. We analyze the progression of the uncertainty sampling metrics over the grid data space described in Section IV as more demonstrations are achieved. As seen in Figure 3(a), Least Confident, Minimum Margin, and Maximum Entropy each tend to fixate selection on the boundaries between ProMPs. Once at least two neighboring ProMPs become well-estimated enough to produce meaningful probability measures, they begin to “compete” over the territory covered in part by both ProMPs. This behavior is not desirable for the purpose of promoting task generalization over the entire space. As such, we found that while these measures are the go-to objective functions for uncertainty-sampling approaches in supervised active learning [13], they do not provide a suitable mechanism for guiding the creation of a ProMP library that can generalize well over a given space.

We propose the Greatest Mahalanobis Distance, described in Section III-C, as an alternative to these standard measures. As seen in Figure 4, the Mahalanobis distance objective tends to converge to low values instead of becoming heightened on boundaries between ProMPs. Even if a task instance can be achieved by multiple ProMPs (i.e. the instance exists near a boundary between two ProMPs), its minimum Mahalanobis distance is unaffected by such competition. We submit that this behavior makes Greatest Mahalanobis Distance the most suitable measure among the four compared for active learning of ProMPs, as it will tend to drive the learning into regions that have not been explored, instead of fixating on boundaries between already well-estimated regions of the task space.

V-B Task Success on Execution

In order to demonstrate the efficacy of the Greatest Mahalanobis Distance measure for active learning, we compare our method against randomly selecting task instances on task executions on the robot as described in Section IV. In order to account for randomness in the learning process, we perform ten trials of learning ProMP libraries over the space. We then chose the ProMP library that achieved the median performance on a validation metric for testing on the robot.

We use the recorded demonstrations over the discretized space to perform the ten learning trials. In each trial we use a random seed for the random sample generation, and we use the same seed to generate a small set of initial samples to initialize our active learning method. For each trial, we generated task instances and collected 25 demonstrations for each method. We then ranked the capability of the ProMP libraries by the value of the Greatest Mahalanobis Distance computed over all task instances for use as our validation metric.

We generated a test set of ten random planar object poses to attempt with each comparison method. We emphasize that the test poses were generated from a continuous set, i.e. they are not selected from candidates in the discretized space, and as such they are not likely to be identical to any instances the methods received demonstrations for. The object was tracked and placed on the table by the user to align with the coordinate frame of the generated instance, as described in Section IV. For each method, the most likely ProMP and condition point to produce task success were selected based on the Greatest Mahalanobis Distance measure for that object pose. The resulting ProMP policy was then executed. We used task completion as our metric of success, where the task is considered successfully completed if the object remains in the robot’s grasp after the lifting phase described in Section IV has completed.

Random selection of task instances resulted in only 2 out of 10 successful grasps. 3 of the instances were attempted but quickly failed due to the robot knocking the object off the table or pushing the object away as the fingers started closing around it. The other 5 instances could not even be attempted due to safety concerns in watching the execution previews in rviz. These were primarily cases where the hand was clearly going to collide with the object or table at a high velocity, risking potential damage to the robot hand. In summary, random selection resulted in 20% success, 30% failure, and 50% infeasible due to safety concerns.

Our Greatest Mahalanobis Distance approach resulted in 6 successful grasps, 3 failed grasps, and only 1 infeasible instance due to safety concerns. Only 1 of the failed grasps was due to the robot missing the grasp entirely due to pushing the object away when the fingers close. The other 2 failures were attempted overhead grasps in which the robot reached a suitable pre-grasp and closed the fingers around the upper portion of the drill, but then proceeded to drop the object on the lifting phase. The infeasible instance was due to what was a clear collision between the fingers and the object at a high velocity. To summarize, our method resulted in 60% task success, 30% failure, and only 10% were infeasible.

We note that the recorded demonstrations were generally either an overhead grasp towards the head of the drill, or a side grasp radially located about the drill handle. Overhead grasps were more suitable when the object was located closer to the base of the robot, whereas side grasps were more appropriate the further the object was located from the base. However, from the user’s perspective, overhead grasps were significantly more difficult to demonstrate successfully. This is primarily due to the weight of the drill requiring a precise grasp pre-shape from above to fully enclose the drill head without losing grip on the lifting phase.

V-C Learning Feasible and Infeasible Task Regions

In some situations the boundaries of the context region $\mathcal{C}_{d}$ to generalize over may not be explicitly known a priori. For such cases we propose a minor extension to our approach enabling the robot to learn an explicit infeasible region $\mathcal{R}$ to avoid. We propose modeling this region using a Gaussian mixture model defined on the context space $\mathcal{C}$ .

To formulate this as an active learning problem we treat the learned mixture of ProMPs as a single positive class, with class probability defined by Equation 5, and the GMM to represent the negative class. When asked to provide a sample the user provides a demonstration as before if the sample represents a point in the feasible region; otherwise the user simply labels the point infeasible and the active learner provides a new sample. In two-class cases, the Least-Confident and Minimum-Margin uncertainty sampling methods are equivalent to Maximum Entropy [13].

Figure 3(b) visualizes the maximum entropy associated with a feasible-infeasible learning trial. In the case pictured an obstacles sits in the center of the table, which the robot should not collide with. We see that the points of highest entropy (lighter colors) lie near the boundary between this infeasible center region and the surrounding areas, known to be feasible from example demonstrations. Thus the maximum entropy metric proves useful in this scenario, selecting samples to refine the boundary between the neighboring feasible and infeasible regions.

VI Conclusion

We have presented a framework for active learning of a library of Probabilistic Movement Primitives from demonstration. Our method leverages existing active learning techniques while utilizing the information encoded in the ProMPs to compute the active learning measure guiding sample selection. We demonstrated with real-robot experiments that our method provides an advantage over randomly choosing demonstrations over the space in which generalization is desired. Our method provides an uncertainty estimate of task success over a given region, enabling the robot to be deployed to situations where a teacher may not be available, e.g. remote missions in space.

In this paper, we only considered task generalization over a static environment. In future work, we will explore adapting our methods to dynamic environments in which task constraints vary over time, such as obstacles that are not fixed features of the environment. Additional future work could examine incorporating a more informed prior for classifying feasible and infeasible regions that either leverages knowledge of an environmental map or could be learned and transferred from previous tasks.

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. G. Atkeson and S. Schaal, “ Robot Learning From Demonstration ,” in International Conference on Machine Learning (ICML) , 1997, pp. 12–20.
2[2] A. Billard, S. Calinon, R. Dillmann, and S. Schaal, “ Robot Programming by Demonstration ,” in Springer Handbook of Robotics . Springer, 2008, ch. 59, pp. 1371–1394.
3[3] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “ A survey of robot learning from demonstration ,” Robotics and Autonomous Systems , vol. 57, no. 5, pp. 469–483, 2009.
4[4] A. Paraschos, C. Daniel, J. Peters, and G. Neumann, “ Using probabilistic movement primitives in robotics ,” Autonomous Robots , vol. 42, no. 3, pp. 529–551, 3 2018.
5[5] A. Paraschos, R. Lioutikov, J. Peters, and G. Neumann, “ Probabilistic Prioritization of Movement Primitives ,” Robotics and Automation Letters , vol. 2, no. 4, pp. 2294–2301, 2017.
6[6] A. Paraschos, E. Rueckert, J. Peters, and G. Neumann, “ Probabilistic movement primitives under unknown system dynamics ,” Advanced Robotics , vol. 32, no. 6, pp. 297–310, 3 2018.
7[7] M. Ewerton, G. Neumann, R. Lioutikov, H. Ben Amor, J. Peters, and G. Maeda, “ Learning multiple collaborative tasks with a mixture of Interaction Primitives ,” in ICRA . IEEE, 5 2015, pp. 1535–1542.
8[8] S. Gomez-Gonzalez, G. Neumann, B. Scholkopf, and J. Peters, “ Using probabilistic movement primitives for striking movements ,” in Humanoids . IEEE, 11 2016, pp. 502–508.