Learning from Extrapolated Corrections
Jason Y. Zhang, Anca D. Dragan

TL;DR
This paper explores how robots can learn cost functions from user corrections by extrapolating limited information, demonstrating that non-Euclidean function spaces improve learning accuracy and user perception.
Contribution
It introduces a novel approach to extrapolate user corrections using online function approximation with non-Euclidean norms, enhancing robot learning from limited guidance.
Findings
Non-Euclidean norms better capture user intent in uncluttered environments
Using these norms improves the accuracy of learned cost functions
User perception of robot performance is positively affected
Abstract
Our goal is to enable robots to learn cost functions from user guidance. Often it is difficult or impossible for users to provide full demonstrations, so corrections have emerged as an easier guidance channel. However, when robots learn cost functions from corrections rather than demonstrations, they have to extrapolate a small amount of information -- the change of a waypoint along the way -- to the rest of the trajectory. We cast this extrapolation problem as online function approximation, which exposes different ways in which the robot can interpret what trajectory the person intended, depending on the function space used for the approximation. Our simulation results and user study suggest that using function spaces with non-Euclidean norms can better capture what users intend, particularly if environments are uncluttered. This, in turn, can lead to the robot learning a more accurate…
| Statement | p-value | ||
|---|---|---|---|
| 1. | By the end, the robot understood how I wanted it to do the task. | 6.6710 | 0.0170 |
| 2. | The robot’s performance improved over time. | 3.1325 | 0.0906 |
| 3. | I had to keep correcting the robot. | 11.2886 | 0.0028 |
| 4. | It was easy to anticipate how the robot would respond to my corrections. | 25.2018 | 0.0001 |
| 5. | It was easy to physically interact with the robot. | 0.0831 | 0.7759 |
| 6. | I knew what to do to get the robot to perform the task correctly. | 16.0177 | 0.0006 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Learning from Extrapolated Corrections
Jason Y. Zhang and Anca D. Dragan Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley 94720{zhang.j,anca}@berkeley.edu
Abstract
Our goal is to enable robots to learn cost functions from user guidance. Often it is difficult or impossible for users to provide full demonstrations, so corrections have emerged as an easier guidance channel. However, when robots learn cost functions from corrections rather than demonstrations, they have to extrapolate a small amount of information – the change of a waypoint along the way – to the rest of the trajectory. We cast this extrapolation problem as online function approximation, which exposes different ways in which the robot can interpret what trajectory the person intended, depending on the function space used for the approximation. Our simulation results and user study suggest that using function spaces with non-Euclidean norms can better capture what users intend, particularly if environments are uncluttered. This, in turn, can lead to the robot learning a more accurate cost function and improves the user’s subjective perceptions of the robot.
I Introduction
Robots typically generate their motion to optimize some cost function [18, 12, 13, 21]. Specifying good cost functions for robot motion planning is difficult for two reasons. First, tuning cost function parameters to get the desired behavior in a single environment, let alone across a range of test environments, can be challenging, as different criteria that are important can be at odds with each other. Second, the designer who is supposed to specify the cost function might actually not know it: we design robots to help end-users, and how the end-users want the robot to move is up to them. Different people might have different preferences, e.g. how far away the robot needs to stay from them as it moves, how much it should try to stay in the user’s visible space, etc.
Inverse Reinforcement Learning [4, 20] is an excellent alternative to manually specifying the cost function. The designer or the user provides the robot with demonstrated trajectories, and the robot infers the cost function that explains the demonstrations. This has been successful in many domains, including driving and social navigation [15, 7, 19, 10, 9]. It has been applied to manipulation in some settings, but it remains difficult to use as a cost learning tool because demonstrations are difficult to provide in manipulation since users must coordinate many degrees of freedom over time [5, 6].
As a result, a relatively new line of work focuses on learning from corrections rather than full demonstrations [3]. Corrections leverage the idea that while providing a sequence of configurations over time is challenging, people can provide a single configuration easily. Rather than generating a trajectory from scratch, the person can modify an existing trajectory by taking one of its waypoints and physically changing it to a new configuration.
With the move from a full demonstration to a corrected waypoint, however, comes a big loss in the amount of information the robot can access. It must now infer the entire trajectory from one waypoint. The assumptions we make when performing this extrapolation can affect the quality of the learning.
In other words, the robot has to estimate what trajectory the person might have intended given its current trajectory and the corrected configuration. Prior work performs this estimation by implicitly assuming that only the one corrected waypoint should change, and the rest of the trajectory should stay the same [3]; or proposes particular ways to deform the trajectory based on the correction [1].
Building on work that has explored deforming trajectories based on changing waypoints in contexts outside of learning cost functions [2, 16], we cast the problem of estimating the full trajectory explicitly as a function approximation problem: we have a current estimate (the current trajectory), we receive one new data point (that the corrected timepoint maps to the corrected configuration), and we re-estimate our trajectory online based on this new data point. Different choices of the space of functions we use for approximation (different inner products) map to different assumptions about what the user intended in prior work, with a notable difference between Euclidean [3] and non-Euclidean [1] inner products.
Naturally, when learning cost functions from corrections, we want to understand which choice better matches what users actually intend, and, more importantly, which leads to the most effective learning. We analyze these questions in simulation and in a user study and find that non-Euclidean inner products that correlate trajectory waypoints across time can often lead to higher user ratings when it comes to how closely the correction matches the user’s intention, in particular for uncluttered scenes. Further, we see that the learning performance is higher with such a choice, both subjectively (as perceived by users) and objectively (when measuring how well the robot learned the desired cost function in a controlled task).
Summary of Contributions. Overall, we find that learning cost functions from corrections benefits from explicitly attempting to estimate the trajectory the user might have intended. It is this explicit estimation lens that exposes our choices for how to interpret and extrapolate from corrections, challenging or validating assumptions we’ve made in the past. From the perspective of work that uses non-Euclidean norms to deform trajectories [16, 2], we validate that these are also useful when learning cost functions based on the deformed trajectories. From the perspective of work that learns cost functions [3, 1], we challenge the notion that Euclidean norms are always best [3], and support the choice to sometimes use non-Euclidean deformation [1].
II Learning Cost from Corrections
Problem Statement. We denote a robot trajectory by , which is represented as a sequence of configurations from a start to final time. In any environment, the robot needs to minimize a cost function which we parametrize as a linear combination of features [20, 22, 3]:
[TABLE]
where is a weight vector and featurizes the trajectory.
The robot does not observe – only how an end-user or robot designer might want the robot to move. We assume that the user or designer implicitly knows the correct but cannot directly explicate it (end users cannot write down cost parameters, and even designers have trouble tuning them in a generalizable way). Instead, the user can correct any current robot trajectory to a trajectory such that
[TABLE]
The robot is penalized according to the ground truth and needs to estimate it from the human’s guidance in order to perform well.
This problem can be characterized as acting in a partially observed system in which at every step the robot executes a trajectory and transitions to a new environment. In this formalism, is the hidden state in the system and the user’s corrected trajectories at every step are observations about .
Solution. Solutions to this problem tend to separate estimating the true weights from finding the best motion plan [3, 19, 1]. At every step , the robot maintains an estimate of (either or a belief ) and uses it to generate the optimal trajectory :
[TABLE]
in the case of a running estimate, or
[TABLE]
in the case of a belief.
When the human provides a correction, the robot infers a corrected trajectory and needs to update its estimate of the weight vector. Under an observation model where corrected trajectories are exponentially more probable when they have lower true cost and a Gaussian prior over , [1] has shown that the MAP can be approximated as
[TABLE]
for positive . This has the intuitive interpretation of finding a cost function in which the corrected trajectory is better than the original trajectory but not deviating too far from the previous estimate.
Taking the gradient and setting it to 0, the new estimate becomes:
[TABLE]
This is the same update rule used in co-active learning [3] and Online Maximum Margin Planning [19] if corrections were demonstrations.
III “Intended” Corrections
Since the goal of learning from corrections is to address cases where users would have a difficult time demonstrating trajectories, it is important to note that the robot does not actually measure a corrected trajectory directly. Instead, after observing the entire trajectory, users correct one point along the path, from to . Thus, when the robot gets a correction, it does not observe what corrected trajectory the user intends, it just observes one point along that intended trajectory and needs to infer the rest.
III-A State of the art
Jain et al. [3] assume that given a trajectory and a correction , the intended correction at the trajectory level is , i.e. that the user only meant to change a single waypoint along the trajectory.
While this might make sense in the context of RRT [14] trajectories which have a few waypoints that are far apart and require large and almost instantaneous changes in velocity to follow, it is likely not what users have in mind when correcting smooth robot trajectories made out of many waypoints (as is typical of trajectories produced via trajectory optimization [18, 12]). If the original trajectory is smooth, then will jerk when going from to to . Not only is this likely not what user intends, but it might also fail to substantially improve the original trajectory, especially if smoothness is part of the ground truth cost.
Bajcsy et al. [1] assume that the intended correction is a deformation of the original trajectory, propagating the change at one waypoint down to the rest of the configurations by multiplying through a linear operator.
In what follows, we generalize this assumption. We present a formalism for deriving the corrected trajectory based on prior work in Dynamic Movement Primitives [2] and physical human-robot interaction [16], and show how the assumption above is one instance of an entire family of possible interpretations for what the user intended.
III-B Formalism of intended corrections
The only thing that the robot knows about the intended correction is that it should go through the corrected waypoint , meaning . We treat finding as one step of an online function approximation problem – we are at , have received a data point , and want to minimally update our estimate to incorporate this new data point:
[TABLE]
Here, distance to the original trajectory is measured with respect to some inner product in the Hilbert space of trajectories.
Solution: If the trajectory is discretized by time, then is a matrix, so . The Lagrangian of (2) is
[TABLE]
Set the gradients w.r.t. , , , and to 0:
[TABLE]
Thus, the solution to (2) is:
[TABLE]
where , , and satisfy (3).
This has an intuitive interpretation: the robot uses the inverse of the norm to propagate the correction to the rest of the trajectory, while keeping the start and the goal fixed. This is analogous to using a norm to respond to changes in the goal, as in [2], and similar to work in responding to a force during haptic robot teleoperation [16] (there, the propagation happens not from the current point, but from a future time point, so that the human does not have to keep providing input).
III-C Metrics for interpreting corrections
Different norms lead to different types of propagations. Fig. 2 shows how different metrics induce different propagation behaviors.
Identity. Setting to be the identity matrix corresponds to using the Euclidean inner product and leads to the optimal solution as in [3]. However, in this work we hypothesize that different, non-Euclidean inner products perform better at capturing what users intend with their corrections and will lead to faster learning. We present some options below.
Velocities and Higher Order Terms. One way to induce smoothness is to penalize changes in velocity. If A is the finite differencing matrix:
[TABLE]
then computes the sum squared velocities of trajectory . Such a matrix and alternatives for higher order derivatives are popular in trajectory optimizers [18, 12, 17] to produce smooth trajectories, and have also been used in physical human-robot interaction [16], including in the context of learning from corrections [1].
The inverse of the finite differencing matrix also has an intuitive interpretation, linearly propagating corrections over the entire trajectory (See Fig. 2).
Gaussian (RBF) Kernel. An alternative is setting to the RBF kernel:
[TABLE]
This provides an additional hyperparameter, , that allows us to tune how local or global the desired propagation is.
Note that when we use an RBF kernel, our estimation update of is analogous to function approximation in Online Kernel Machines [11]. There, a new data point adds a new term to the function which propagates the change via the kernel. This is equivalent to Equation 3, modulo our end point constraints and , and the fact that we impose as a hard constraint.
III-D Overall Algorithm
Put altogether, we present the following algorithm (Alg. 1). First, we initialize the weight vector to zero. Each iteration, a trajectory optimizer [12] plans the optimal trajectory with current weight vector . We present the trajectory to the user who provides a correction consisting of a timepoint and a joint configuration. We extrapolate to a full trajectory by treating the feedback as an online function approximation problem subject to a predefined inner product norm. Finally, we update our weights in the direction of the difference in the features of the trajectory and those of the original planned trajectory , weighted by the learning rate . Fig. 3 depicts one iteration of the algorithm.
Note that as long as the user’s corrections produce trajectories with lower cost in expectation, the expected regret of our algorithm has an upper bound of after iterations [22].
IV Results in Simulation
We first provide analysis in simulation to see how the choice of norm impacts performance and whether non-Euclidean norms might be more effective in certain environments.
Environments. We simulated a set of environments with objects of varying types. We varied the number of such types (features), as well as the number of instances for each type. Fig. 5 shows an example environment, and Fig. 4 shows the effect of each norm on the trajectory. For each environment, we assigned a random ground truth objective function (i.e. weight vector).
Simulating user input. For each environment, we use the ground truth weights to plan the ground truth trajectory . We then iterate by planning a trajectory for the current weights (starting with ), selecting the timepoint at which the optimal trajectory and the planned trajectory differ the most (calculated using ), and correcting that waypoint to the its value in the ground truth trajectory.
Implementation details. We used Trajopt [12] as our trajectory optimizer and simulated the environments in OpenRAVE [8]. For each combination of (number of types, number of instances) pair, we generated 25 different environments. To make sure that cost values in different environments were comparable, we normalized costs such that the optimal trajectory had a cost of 0 and the initial trajectory (straight-line in configuration space) had a cost of 1.
In addition, we tuned the learning rate for each norm individually. Since wider kernels tend to update the weight vector more dramatically, we found that the optimal learning rates for such kernels were lower than those of narrower kernels.
Analysis. Fig. 6 shows the results of the simulations. Overall, we found that in simple environments with few object types and instances, wider propagations (i.e. wider kernels for ) result in lower cost over time. However, as environments become increasingly complex, norms that produce narrower propagations become more effective as the user has more fine-tuned control over the correction. In environments with many object types, the advantage of wider propagataions is less noticeable. In cluttered environments, wider propagations are more likely to cause the algorithm to unintentionally infer the wrong updates to other features (See Fig. 3). Overall, the velocities norm compares somewhat favorably to the Euclidean distance, with the exception of environments with both high clutter and many feature types.
V User Study
Our simulations revealed that while there are situations where Euclidean corrections work well, there are also many cases where non-Euclidean is preferable. We designed and conducted a user study to test this result with real users.
V-A Experimental Design
Task: We instructed users to teach a JACO2 7-degree-of-freedom arm to plan trajectories that balance three different properties:
- •
keep the cup close to the table,
- •
keep the cup over the table, and
- •
keep the cup away from the laptop.
For each iteration, the user provided a correction to the trajectory by selecting one of the waypoints verbally and then physically correcting that waypoint in gravity-compensation mode.
Independent Variables: We manipulated 3 independent variables:
- •
Norm for Interpreting Corrections: We tested our algorithm with the Euclidean Norm and the Velocities Norm. The user could provide up to five corrections with each method, stopping if the user felt that the planned trajectory looked exactly like the optimal trajectory.
- •
Location Strategy: We tested two strategies to come up with user corrections. Users have to decide on the time point to give the correction at, so in one condition, we provided no instructions for selecting the correction and let them choose what intuitively made sense to them (Anywhere); whereas in another condition, we instructed them to choose the time point at which the optimal trajectory and the planned trajectory differed the most (Largest).
- •
Environment: We designed two environments to make sure our results were not specific to a particular setting. We chose the environments such that one might benefit from wider propagations, whereas another might require more local corrections because correcting the different features (distances to table, laptop) might come in conflict. Fig. 3 shows an example of how global propagations can lead to counter-productive updates.
Dependent Variables: We designed measures that can capture whether people intend non-Euclidean corrected trajectories, both objectively and subjectively
Since we could not directly access each user’s internal preferences, we defined a set of optimal weights and showed the optimal trajectory planned using those weights to the users to ground their preferences and make sure they understood the task. Thus, participants tried to recreate the optimal behavior. This enabled us to measure the robot’s performance based on an objective measure of cost computed from the optimal weights. If the robot produces trajectories closer to the intended corrections, then those trajectories should have lower cost.
At the end of each iteration, users rated how closely the corrected trajectory matched what they had in mind. This enabled us to measure subjectively how good each norm is at producing the intended correction. We had participants rate the planned trajectory as well to evaluate how well they thought the robot was learning from the correction. Once they were finished giving corrections using each method, the participants rated how well the robot understood their corrections and the ease of teaching the robot.
Hypotheses:
- H1:
The non-Euclidean norm produces trajectories with lower overall cost (by the end of the learning and also along the way).
- H2:
The non-Euclidean norm leads to corrected trajectories that better match what the user intends, as evaluated via self-reports.
- H3:
The user perceives corrections with the non-Euclidean norm as more successful at learning and easier to use for teaching.
Participants: We recruited 26 participants (11M, 15F) from UC Berkeley students. 15 participants reported having a technical background. The norm factor was within-subjects: participants provided corrections interpreted using both norms to provide a calibrated comparison. The location strategy and environment factors were between-subjects: we assigned one environment and one location strategy to each participant. We presented the methods in counterbalanced order.
V-B Analysis
Objective: We conducted a factorial repeated measures ANOVA with environment type (1 or 2), location strategy (anywhere or largest), and norm type (Euclidean or Velocities) as factors, on the cost. We used time (iteration 1 through 5) as a factor as well. We found that environment (), location (), time (), and norm () have statistically significant effects on cost (See Fig. 8).
As expected, cost decreased over time. Surprisingly, the anywhere location strategy was significantly better, suggesting that end-users are intuitively able to choose good points to intervene with a correction.
There was an interaction effect only for environment and norm (). A post-hoc analysis with Tukey HSD showed that the non-Euclidean norm led to significantly lower cost in environment 2, and lower but only marginally significant in environment 1. Overall, our findings support H1.
Subjective: We conducted a factorial repeated measure ANOVA with environment type, location of correction, and norm type as factors on the average rating for corrected and planned trajectories. We found that environment (), location (), and norm () have a statistically significant effect on corrected trajectory ratings. No interaction effects were statistically significant. We found that environment 2 was harder for users. Surprisingly, they perceived the anywhere strategy as less effective, even though objectively it performed better. But most importantly, in line with H2, they perceived the non-Euclidean corrections to better match their intended corrections (See Fig. 7).
Only norm () had a statistically significant effect on planned trajectory ratings, in the direction we hypothesized (H3): non-Euclidean corrections led to better planned trajectories.
Finally, we also ran an ANOVA on subjective ratings for the users’ experience with each norm. Table I summarizes the results. These were also in support of H3.
VI Discussion
Summary. When receiving a correction, the robot does not observe the entire intended trajectory, instead receiving only a single data point. When we explicitly account for that lack of knowledge, we are faced with an online function approximation problem. Solving it in a non-Euclidean inner product space can lead to better learning in some environments than when making a default assumption about what the user intended, either in lower cost or in fewer required interventions from the user.
**Limitations and Future Work. ** The biggest limitation of our method is that it still needs to commit to an inner product or norm in order to interpret the corrections, and while we’ve found one that worked well for many of the tasks we tested, different tasks might benefit from different norms (See Fig. 3). Future work should investigate ways of learning the desired norm interactively from the user.
One limitation in the user study is that we present users with an optimal trajectory, thus trading the external validity of having real preferences for the benefit of having a more objective measure of cost. Future work should complement our study with one that seeks users’ internal preferences and only evaluates the learning subjectively.
Further, there could be ways of learning from corrections that do not require the intermediate step of inferring a trajectory. While these might be expensive now (e.g. reasoning about the Q-value of a corrected action rather than the cumulative cost of an entire trajectory), approximation methods for them might entirely bypass the need for an intended trajectory.
Acknowledgements
This research was supported by funding from Open Philanthropy, AFOSR, and an NSF Career Award. We thank Andrea Bajcsy for insightful discussion and sharing code. We would also like to thank all members of the members of the InterACT lab for helpful feedback.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. O’Malley A. Dragan A. Bajcsy, D. Losey. Learning robot objectives from physical human interaction. In Conference on Robot Learning , pages 217–226, 2017.
- 2[2] J. Bagnell A. Dragan, K. Muelling and S. Srinivasa. Movement primitives via optimization. In International Conference on Robotics and Automation (ICRA) , Pittsburgh, PA, May 2015.
- 3[3] T. Joachims A. Jain and A. Saxena. Learning trajectory preferences for manipulators via iterative improvement. Co RR , abs/1306.6294, 2013.
- 4[4] P. Abbeel and A. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning , page 1. ACM, 2004.
- 5[5] J. Yoo B. Akgun, M. Cakmak and A. Thomaz. Trajectories and keyframes for kinesthetic teaching: A human-robot interaction perspective. In Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction , pages 391–398. ACM, 2012.
- 6[6] M. Veloso B. D. Argall, S. Chernova and B. Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems , 57(5):469 – 483, 2009.
- 7[7] G. Gallagher C. Mertz K. Peterson J. Bagnell M. Hebert A. Dey B. Ziebart, N. Ratliff and S. Srinivasa. Planning-based prediction for pedestrians. In Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on , pages 3931–3936. IEEE, 2009.
- 8[8] R. Diankov. Automated Construction of Robotic Manipulation Programs . Ph D thesis, Robotics Institute , Carnegie Mellon University, Pittsburgh, PA, September 2010.
