Enabling Robots to Infer how End-Users Teach and Learn through   Human-Robot Interaction

Dylan P. Losey; Marcia K. O'Malley

arXiv:1902.00646·cs.RO·February 5, 2019

Enabling Robots to Infer how End-Users Teach and Learn through Human-Robot Interaction

Dylan P. Losey, Marcia K. O'Malley

PDF

TL;DR

This paper proposes a Bayesian inference approach for robots to personalize understanding of human interaction strategies during HRI, improving learning and teaching by adapting to individual user behaviors.

Contribution

It introduces a method for robots to infer and adapt to individual human interaction strategies using Bayesian inference, moving beyond fixed strategy assumptions.

Findings

01

Personalized approach outperforms fixed strategy methods in simulations.

02

Robust inference of human strategies improves robot learning and teaching.

03

Bayesian framework effectively models diverse human interaction behaviors.

Abstract

During human-robot interaction (HRI), we want the robot to understand us, and we want to intuitively understand the robot. In order to communicate with and understand the robot, we can leverage interactions, where the human and robot observe each other's behavior. However, it is not always clear how the human and robot should interpret these actions: a given interaction might mean several different things. Within today's state-of-the-art, the robot assigns a single interaction strategy to the human, and learns from or teaches the human according to this fixed strategy. Instead, we here recognize that different users interact in different ways, and so one size does not fit all. Therefore, we argue that the robot should maintain a distribution over the possible human interaction strategies, and then infer how each individual end-user interacts during the task. We formally define learning…

Figures11

Click any figure to enlarge with its caption.

Equations50

π (u ∣ x, θ, ϕ)

π (u ∣ x, θ, ϕ)

π (a ∣ x, θ, ψ)

π (a ∣ x, θ, ψ)

b^{t + 1} (θ) = P (θ ∣ u^{0 : t}, x^{0 : t})

b^{t + 1} (θ) = P (θ ∣ u^{0 : t}, x^{0 : t})

b^{t + 1} (θ) = \frac{b ^{t} ( θ ) \cdot P ( u ^{t} ∣ x ^{t} ; θ )}{\int _{Θ} b ^{t} ( ξ ) \cdot P ( u ^{t} ∣ x ^{t} ; ξ ) d ξ}

b^{t + 1} (θ) = \frac{b ^{t} ( θ ) \cdot P ( u ^{t} ∣ x ^{t} ; θ )}{\int _{Θ} b ^{t} ( ξ ) \cdot P ( u ^{t} ∣ x ^{t} ; ξ ) d ξ}

b^{t + 1} (θ) \propto b^{t} (θ) \cdot P (u^{t} ∣ x^{t}; θ)

b^{t + 1} (θ) \propto b^{t} (θ) \cdot P (u^{t} ∣ x^{t}; θ)

P (u^{t} ∣ x^{t}; θ) = π (u^{t} ∣ x^{t}; θ, ϕ^{0})

P (u^{t} ∣ x^{t}; θ) = π (u^{t} ∣ x^{t}; θ, ϕ^{0})

P (u^{t} ∣ x^{t}; θ) = \int_{Φ} π (u^{t} ∣ x^{t}; θ, ϕ) \cdot b^{0} (ϕ) d ϕ

P (u^{t} ∣ x^{t}; θ) = \int_{Φ} π (u^{t} ∣ x^{t}; θ, ϕ) \cdot b^{0} (ϕ) d ϕ

b^{t + 1} (θ, ϕ) = P (θ, ϕ ∣ u^{0 : t}, x^{0 : t})

b^{t + 1} (θ, ϕ) = P (θ, ϕ ∣ u^{0 : t}, x^{0 : t})

b^{t + 1} (θ, ϕ) \propto b^{t} (θ, ϕ) \cdot P (u^{t} ∣ x^{t}; θ, ϕ)

b^{t + 1} (θ, ϕ) \propto b^{t} (θ, ϕ) \cdot P (u^{t} ∣ x^{t}; θ, ϕ)

b^{t + 1} (θ, ϕ) \propto b^{t} (θ, ϕ) \cdot π (u^{t} ∣ x^{t}; θ, ϕ)

b^{t + 1} (θ, ϕ) \propto b^{t} (θ, ϕ) \cdot π (u^{t} ∣ x^{t}; θ, ϕ)

P (u^{t} ∣ x^{t}; θ) = \int_{Φ} π (u^{t} ∣ x^{t}; θ, ϕ) \cdot b^{t} (ϕ ∣ θ) d ϕ

P (u^{t} ∣ x^{t}; θ) = \int_{Φ} π (u^{t} ∣ x^{t}; θ, ϕ) \cdot b^{t} (ϕ ∣ θ) d ϕ

b^{t} (ϕ ∣ θ) = P (ϕ ∣ u^{0 : t - 1}, x^{0 : t - 1}; θ)

b^{t} (ϕ ∣ θ) = P (ϕ ∣ u^{0 : t - 1}, x^{0 : t - 1}; θ)

b^{t + 1} (θ) = \frac{b ^{t} ( θ ) \cdot π ( a ^{t} ∣ x ^{t} ; θ ; ψ ^{*} )}{Z ( ψ ^{*} )}

b^{t + 1} (θ) = \frac{b ^{t} ( θ ) \cdot π ( a ^{t} ∣ x ^{t} ; θ ; ψ ^{*} )}{Z ( ψ ^{*} )}

Z (ψ) = \int_{Θ} b^{t} (ξ) \cdot π (a^{t} ∣ x^{t}; ξ; ψ) d ξ

Z (ψ) = \int_{Θ} b^{t} (ξ) \cdot π (a^{t} ∣ x^{t}; ξ; ψ) d ξ

a^{t} = arg a max b^{t + 1} (θ^{*})

a^{t} = arg a max b^{t + 1} (θ^{*})

u^{t} = h (b^{t}) = b^{t}

u^{t} = h (b^{t}) = b^{t}

\hat{b}^{t + 1} (θ) = P (θ ∣ u^{0 : t}, a^{0 : t}, x^{0 : t})

\hat{b}^{t + 1} (θ) = P (θ ∣ u^{0 : t}, a^{0 : t}, x^{0 : t})

\hat{b}^{t + 1} (θ) = \int_{Ψ} \frac{u ^{t} ( θ ) \cdot π ( a ^{t} ∣ x ^{t} ; θ , ψ )}{Z ( ψ )} \cdot b^{t} (ψ) d ψ

\hat{b}^{t + 1} (θ) = \int_{Ψ} \frac{u ^{t} ( θ ) \cdot π ( a ^{t} ∣ x ^{t} ; θ , ψ )}{Z ( ψ )} \cdot b^{t} (ψ) d ψ

\hat{b}^{t + 1} (θ) = \frac{u ^{t} ( θ ) \cdot π ( a ^{t} ∣ x ^{t} ; θ , ψ ^{0} )}{Z ( ψ ^{0} )}

\hat{b}^{t + 1} (θ) = \frac{u ^{t} ( θ ) \cdot π ( a ^{t} ∣ x ^{t} ; θ , ψ ^{0} )}{Z ( ψ ^{0} )}

\hat{b}^{t + 1} (θ) = \int_{Ψ} \frac{u ^{t} ( θ ) \cdot π ( a ^{t} ∣ x ^{t} ; θ , ψ )}{Z ( ψ )} \cdot b^{0} (ψ) d ψ

\hat{b}^{t + 1} (θ) = \int_{Ψ} \frac{u ^{t} ( θ ) \cdot π ( a ^{t} ∣ x ^{t} ; θ , ψ )}{Z ( ψ )} \cdot b^{0} (ψ) d ψ

b^{t} (ψ) = P (ψ ∣ u^{0 : t}, a^{0 : t}, x^{0 : t})

b^{t} (ψ) = P (ψ ∣ u^{0 : t}, a^{0 : t}, x^{0 : t})

b^{t} (ψ) \propto P (u^{0 : t} ∣ a^{0 : t}, x^{0 : t}; ψ) \cdot P (ψ ∣ a^{0 : t}, x^{0 : t})

b^{t} (ψ) \propto P (u^{0 : t} ∣ a^{0 : t}, x^{0 : t}; ψ) \cdot P (ψ ∣ a^{0 : t}, x^{0 : t})

b^{t}(\psi)\propto b^{t-1}(\psi)\cdot P\bigg{[}u^{t}~{}\bigg{|}~{}\frac{u^{t-1}\cdot\pi(a^{t-1}~{}|~{}x^{t-1};\theta,\psi)}{Z(\psi)}\bigg{]}

b^{t}(\psi)\propto b^{t-1}(\psi)\cdot P\bigg{[}u^{t}~{}\bigg{|}~{}\frac{u^{t-1}\cdot\pi(a^{t-1}~{}|~{}x^{t-1};\theta,\psi)}{Z(\psi)}\bigg{]}

a^{t}=\text{arg}\max_{a}\big{\{}b^{t+1}(\theta^{*})-\lambda\cdot H(b^{t+1}(\psi))\big{\}}

a^{t}=\text{arg}\max_{a}\big{\{}b^{t+1}(\theta^{*})-\lambda\cdot H(b^{t+1}(\psi))\big{\}}

\pi\propto\exp\Big{\{}\alpha\Big{[}Q(x,u,\theta^{*})+\phi^{*}\big{(}R(x^{\prime},\theta^{*})-R(x,\theta^{*})\big{)}\Big{]}\Big{\}}

\pi\propto\exp\Big{\{}\alpha\Big{[}Q(x,u,\theta^{*})+\phi^{*}\big{(}R(x^{\prime},\theta^{*})-R(x,\theta^{*})\big{)}\Big{]}\Big{\}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Enabling Robots to Infer how End-Users Teach and Learn through Human-Robot Interaction

Dylan P. Losey, Student Member, IEEE, and Marcia K. O’Malley, Senior Member, IEEE This work was funded in part by the NSF GRFP-1450681.

The authors are with the Mechatronics and Haptic Interfaces Laboratory, Department of Mechanical Engineering, Rice University, Houston, TX 77005. (e-mail: [email protected])

Abstract

During human-robot interaction (HRI), we want the robot to understand us, and we want to intuitively understand the robot. In order to communicate with and understand the robot, we can leverage interactions, where the human and robot observe each other’s behavior. However, it is not always clear how the human and robot should interpret these actions: a given interaction might mean several different things. Within today’s state-of-the-art, the robot assigns a single interaction strategy to the human, and learns from or teaches the human according to this fixed strategy. Instead, we here recognize that different users interact in different ways, and so one size does not fit all. Therefore, we argue that the robot should maintain a distribution over the possible human interaction strategies, and then infer how each individual end-user interacts during the task. We formally define learning and teaching when the robot is uncertain about the human’s interaction strategy, and derive solutions to both problems using Bayesian inference. In examples and a benchmark simulation, we show that our personalized approach outperforms standard methods that maintain a fixed interaction strategy.

Index Terms:

Cognitive Human-Robot Interaction; Learning from Demonstration; Human Factors and Human-in-the-Loop

I Introduction

Human-robot interaction (HRI) provides an opportunity for the human and robot to exchange information. The robot can learn from the human by observing their behavior [1], or teach the human through its own actions [2]. In applications such as autonomous cars, personal robots, and collaborative assembly, fluent human-robot communication is often necessary.

In order to learn from and teach with interactions, however, the human and robot must correctly interpret the meaning of each other’s behavior. Consider an autonomous car following behind a human driven car. If the human car slows down, what should the robotic car infer: is the human teaching the robot to also slow down, or signaling that the robot should pass? When learning from an end-user, the robot needs a model of that end-user’s teaching strategy, i.e., how the human’s actions relate to the information that human wants to convey. Conversely, when teaching the end-user, the robot must model that end-user’s learning strategy, i.e., how the human interprets the robot’s actions. Together, we define the end-user’s teaching and learning strategies as their interaction strategy.

In the state-of-the-art, the robot assigns a pre-programmed, fixed interaction strategy to every human; each individual end-user is assumed to teach or learn in the same way. Instead:

We here recognize that different users have different interaction strategies, and we should infer the current end-user’s interaction strategy based on their actions.

Rather than a single fixed estimate of the human’s interaction strategy, we argue that the robot should maintain a distribution (i.e., belief) over the possible human interaction strategies, and update this belief during the task. By reasoning over this belief, the robot can adapt to everyday end-users, instead of requiring each human to comply with its single pre-defined strategy.

Overall, we make the following contributions:

Learning and Teaching with Strategy Uncertainty. We introduce and formulate two novel problems in human-robot interaction, where the robot must optimally communicate with the human, but the robot is unsure about how the current end-user teaches or learns.

Solution with Bayesian Inference. We derive methods for the robot to learn and teach under strategy uncertainty. We show that—when the robot does not know the end-user’s interaction strategy—optimal solutions infer and update a belief over that interaction strategy, resulting in personalized interactions.

Simulated Comparison to Current Methods. Using didactic examples and an inverse reinforcement learning simulation, we compare our proposed approach to robots that reason over a fixed interaction strategy. We also consider practical challenges such as noisy and unmodeled interaction strategies.

II Related Work

II-A Robots Learning from Humans

When a human expert is using interactions to teach a robot, the robot can leverage learning from demonstration (LfD) to understand how it should behave [1]. Most similar to our setting is inverse reinforcement learning (IRL), an instance of LfD where the robot learns the correct reward function from human demonstrations [3, 4]. Prior works on IRL generally assume that every human has a single, fixed teaching strategy [5]: the human teaches by providing optimal demonstrations, and any sub-optimal human behavior is interpreted as noise [6, 7, 8, 9]. Alternatively, robots can also learn about the human while learning from that human. In Nikolaidis et al. [10], for instance, the robot learns about the end-user’s adaptability in addition to their reward function. Building on these works, we will infer the end-user’s teaching strategy, so that the robot can more accurately learn from human interactions.

II-B Robots Teaching Humans

Machine teaching—also known as algorithmic teaching—identifies the best way for an expert robot to teach the novice human [2]. In order to teach optimally, however, the robot must know how the human learns. Recent machine teaching works [11, 12, 13] have addressed this problem by using human feedback to resolve mismatches between the assumed human learning strategy and the user’s actual learning strategy. Most related to our research is work by Huang et al. [14], which compares the performance of different models of human learning. These authors generated the optimal teaching examples for each proposed learning strategy, and then used human feedback to identify the single best teaching strategy across all users. Like Huang et al. [14], we here reason over multiple learning strategies, but now we want to infer each user’s specific learning strategy based on their individual responses.

III Problem Statement

Consider a human who is interacting with a robot. In this setting, both the human and the robot are agents. Let us assume that one of these agents has a target model, $\theta^{*}\in\Theta$ , which they want to teach to the other agent. Here $\Theta$ is the space of possible target models, and $\theta^{*}$ is particular behavior that the teacher wants to convey to the learner. For example, the teacher may want to show the learner a better way to complete the current task, or communicate how it will interact in the future. We are interested in how the robot should behave when it is (a) learning $\theta^{*}$ from or (b) teaching $\theta^{*}$ to a human agent.

III-A Notation

Let us denote the robot state as $x$ . The human takes action $u$ , and the robot takes action $a$ ; these actions and the state $x$ are observed by both the robot and the human. We use a superscript $t$ to denote the current timestep, so that $x^{t}$ is the state at time $t$ , and $x^{0:t}$ is the sequence of states from the start of the task to the current time $t$ . In the context of supervised learning, we can think of $x$ as the input features, and $u$ and $a$ as the output labels assigned by the human and robot, respectively [15]. Here $\Theta$ is a hypothesis space, and $\theta^{*}$ defines the correct mapping from features to labels.

III-B Learning from the Human

When the human is the expert—i.e., the human knows $\theta^{*}$ , but the robot does not—the robot should learn from the human. The human wants to teach the robot $\theta^{*}$ , and has a teaching strategy $\phi^{*}$ , which determines what actions the human selects to convey $\theta^{*}$ to the robot. More formally, a teaching strategy $\phi\in\Phi$ relates the setting $(x,\theta)$ to the human action $u$ :

[TABLE]

Here $\pi\in[0,1]$ is the probability that the human will take action $u$ given $x$ , $\theta$ , and $\phi$ . We point out that (1) is also the human’s policy when teaching the robot, and that this policy is parameterized by $\phi$ . In other words, if the robot knows the teaching strategy $\phi^{*}$ , then it can leverage (1) to correctly interpret the meaning behind the human’s actions.

In practice, however, the robot does not know what teaching strategy an end-user will employ. Hence, we argue that the robot should maintain a probability distribution over $\phi$ as it learns from the human. We refer to this problem of learning from the end-user when uncertain about their teaching strategy as learning with strategy uncertainty:

Definition 1

(Learning with Strategy Uncertainty). Given a discrete or continuous set of possible teaching strategies $\Phi$ and target models $\Theta$ , infer an optimal estimate of $\theta^{*}$ based on the history of states $x^{0:t}$ and human actions $u^{0:t}$ .

III-C Teaching the Human

Next we consider the opposite situation, where the robot is the expert, and is trying to teach $\theta^{*}$ to the human. Here the human agent has some learning strategy $\psi^{*}$ , which determines how the human interprets the robot’s actions $a$ . A learning strategy $\psi\in\Psi$ expresses the relationship (from the human’s perspective) between the setting $(x,\theta)$ and the robot action $a$ :

[TABLE]

In the above, $\pi$ is the human’s model of the robot’s policy—not necessarily the robot’s actual policy—and this model is parameterized by $\psi$ . So now, if the robot knows the user’s true learning strategy $\psi^{*}$ , the robot can leverage (2) to anticipate how its actions will alter the human’s understanding of $\theta^{*}$ .

But, when teaching an actual end-user, the robot does not know what learning strategy that specific user has. Similar to before, we therefore argue that the robot should maintain a distribution over the learning strategies $\psi$ when teaching the human. We refer to this problem, where the robot is teaching a user but is unsure about that end-user’s learning strategy, as teaching with strategy uncertainty:

Definition 2

(Teaching with Strategy Uncertainty). Given a discrete or continuous set of possible learning strategies $\Psi$ and target models $\Theta$ , select the robot action $a^{t}$ that optimally teaches $\theta^{*}$ based on the history of states $x^{0:t}$ , robot actions $a^{0:t-1}$ , and human actions $u^{0:t}$ .

III-D Assumptions

Throughout this work, we will assume that the interaction strategies $\phi^{*}$ and $\psi^{*}$ for each individual user are constant, and are not affected by the robot’s behavior. Put another way, the robot cannot influence the human’s interaction strategy by selecting different actions. This assumption is consistent with prior HRI research [2, 5]: however, we can also extend our proposed approach to address cases where the human’s interaction strategy does change by incorporating a forgetting factor or transition model within the Bayesian inference.

IV Robot Learning with Strategy Uncertainty

Within this section we focus on learning from the human, where the robot does not initially know the human’s teaching strategy $\phi^{*}$ . Learning here is challenging, because the robot is uncertain about how to interpret the human’s actions. First, we demonstrate how the robot can learn from multiple models of the human’s teaching strategy. Second, we enable the robot to update its joint distribution over $\phi$ and $\theta$ , and simultaneously learn both the human’s teaching strategy and target model. We provide an example which compares learning this joint distribution to learning with a single fixed estimate of $\phi^{*}$ .

IV-A Multiple Teaching Strategies

The robot starts with a prior $b^{0}(\theta)$ over what $\theta^{*}$ is, and updates that belief at every timestep $t$ based on the observed states and actions. The robot’s belief over target models is:

[TABLE]

In other words, $b^{t+1}(\theta)$ is the probability that $\theta=\theta^{*}$ given the history of observed states and human actions up to timestep $t$ . Applying Bayes’ rule, and recalling from (1) that the human’s actions $u$ are conditionally independent, the robot’s Bayesian belief update becomes [16]:

[TABLE]

We here used a semicolon to separate the observed variables from the hidden variables. The denominator—which integrates over all possible target models—is a normalizing constant. Omitting this constant, we can more succinctly write (4) as:

[TABLE]

where $P(u~{}|~{}x;\theta)$ is the robot’s observation model, i.e., the likelihood that the human takes action $u$ given $x$ and $\theta$ .

To correctly learn from the end-user, the robot needs an accurate observation model. We saw in Section III-B that the most accurate observation model is the user’s policy $\pi$ , which is parameterized by the true teaching strategy $\phi^{*}$ . Within the state-of-the-art, the robot often assumes that the user’s policy is parameterized by $\phi^{0}$ , where $\phi^{0}$ is some estimate of $\phi^{*}\,$ :

[TABLE]

Rather than a constant point estimate of the human’s teaching strategy, we argue that the robot should maintain a belief over multiple teaching strategies. In the simplest case, the robot has a prior $b^{0}(\phi)$ over what $\phi^{*}$ is, but does not update this belief between timesteps. Here the observation model becomes:

[TABLE]

Note that (6) is a special case of (7) where $b^{0}(\phi^{0}\,)=1$ . When learning with (7), the robot does not interpret human actions in the context of just one teaching strategy. Instead, the robot considers what the action $u$ implies for each possible teaching strategy, and then learns across these strategies. We can think of (7) as the best fixed learning strategy when $b^{0}$ is known.

IV-B Inferring a Joint Belief

Now that we have introduced learning with multiple teaching strategies, we can solve learning with strategy uncertainty (Definition 1). Here we not only want to learn the target model $\theta^{*}$ , but we also recognize that the robot is uncertain about $\phi^{*}$ . Let us define the robot’s joint belief $b(\theta,\phi)$ over the target models $\theta\in\Theta$ and teaching strategies $\phi\in\Phi$ to be:

[TABLE]

Again leveraging Bayes’ rule and conditional independence:

[TABLE]

where $P(u~{}|~{}x;\theta,\phi)$ is the conditional probability of human action $u$ given $x$ , $\theta$ , and $\phi$ . But this is the same as (1), so that:

[TABLE]

Using (10), we learn about both the human’s target model and the human’s teaching strategy from $x^{0:t}$ and $u^{0:t}$ .

Let us compare the observation model for this joint learning rule to the observation models from (6) and (7). If we rewrite (10) into the form of (5), we obtain the observation model:

[TABLE]

where the belief over teaching strategies given $\theta$ is:

[TABLE]

Intuitively, a robot implementing (11) and (12) reasons across multiple teaching strategies when learning from the end-user, and also updates its belief over these teaching strategies every timestep. We find that the observation model (7) is a special case of (11) when the robot never updates ${b^{0}(\phi~{}|~{}\theta)=b^{0}(\phi)}$ , i.e., if the robot’s belief over teaching strategies is constant. Accordingly, (6) is a special case of (11) by extension. Our analysis shows that inferring a joint belief over $\theta$ and $\phi$ both generalizes prior work and is an optimal learning rule.

IV-C Learning Example

To demonstrate how the proposed observation models affect the robot’s learning, we here provide an example simulation. Consider the sorting task in Fig. 1, where the robot is attempting to learn the right threshold classifier from the human. At each timestep $t$ , the human action $u$ indicates one screw that should be classified as short; the robot then classifies the remaining screws without additional guidance. Let $\theta^{*}$ be the correct decision boundary, and let the robot’s reward equal the total number of screws classified correctly. We can think of this example as an instance of inverse reinforcement learning [4], where the robot learns the true objective $\theta^{*}$ .

Importantly, we include two different teaching strategies $\phi\in\Phi$ that the human might use. Within the first strategy, $\phi_{1}$ , the human noisily indicates the short screw closest to $\theta^{*}$ , so that ${\pi(\phi_{1})\propto\exp\{-\frac{1}{2}\cdot|\theta^{*}-u|\}}$ . Within the second strategy, $\phi_{2}$ , the user indicates a short screw uniformly at random, so that $\pi(\phi_{2})\propto 0.9$ if $u\leq\theta^{*}$ or $\pi(\phi_{2})\propto 0.1$ otherwise. Each end-user leverages one of these two teaching strategies; however, the robot does not know which.

Observation Models. We compare (6), (7), and (11). Let $\phi_{1}$ denote a robot that learns with (6), and assumes $\phi_{1}=\phi^{*}$ for all users. Similarly, $\phi_{2}$ is a robot that assumes $\phi_{2}=\phi^{*}$ . Prior denotes a robot with observation model (7), and Joint leverages our proposed approach (11). Finally, $\phi^{*}$ is an ideal robot that knows the teaching strategy for each individual user.

Simulation. At timestep $t$ the robot observes the action $u^{t}$ and updates its belief $b^{t+1}(\theta)$ with (5). Next, the robot optimally sorts $10$ screws based on its current belief [8]. At timestep $t+1$ the task is repeated with the same end-user (who has a constant $\theta^{*}$ and $\phi^{*}$ ). The results of these simulations averaged across $10^{5}$ end-users are shown in Figs. 2 and 3.

Analysis. Using our proposed Joint observation model resulted in fewer errors than learning under $\phi_{1}$ , $\phi_{2}$ , or Prior. With Joint the robot was able to personalize its learning strategy to the current end-user across multiple iterations, and more accurately learn what the user was communicating (see Fig. 2). As expected, Prior outperformed other fixed strategies when $b^{0}$ was correct; however, if the robot did not have an accurate prior over teaching strategies, the Prior observation model (7) was less optimal than $\phi_{1}$ (see Fig. 3, right). We found that our proposed approach was robust to this practical challenge: despite having the wrong prior, Joint still caused the robot’s behavior to converge to the ideal learner, $\phi^{*}$ .

V Robot Teaching with Strategy Uncertainty

Within this section we consider the opposite problem, where the expert robot is teaching the human about $\theta^{*}$ , but does not know the end-user’s learning strategy $\psi^{*}$ . Teaching here is challenging because the robot is not certain what the user will learn from its actions. We first outline a specific instance of robot teaching, where the human learns through Bayesian inference. Next, we demonstrate how the robot can teach with multiple models of the human’s learning strategy, and derive one solution to teaching with strategy uncertainty. In a simulated example, we compare these methods to robots that teach with a constant point estimate of $\psi^{*}$ . We also describe how the robot can trade-off between teaching $\theta^{*}$ to and learning $\psi^{*}$ from the human via active teaching.

V-A Teaching Bayesian Humans

Similar to previous works in machine teaching [14, 17] and cognitive science [18, 19], we assume that the human learns by performing Bayesian updates. Thus, the human’s belief over the target models after robot action $a^{t}$ becomes:

[TABLE]

where we use ; to denote that the human observes $\psi^{*}$ but not $\theta^{*}$ . The denominator is again the normalizing constant:

[TABLE]

We point out that $\pi$ in (13) and (14) is (2), the policy that the human assigns to the robot. The human interprets the robot’s actions—and updates its belief—based on this policy, which is parameterized by the human’s true learning strategy $\psi^{*}$ . Here $b^{t}$ is also the state of the human at timestep $t$ , and (13) defines the state dynamics (i.e., the human’s transition function).

The robot should select actions so that this state transitions to $b(\theta^{*})=1$ . Let us define the ideal robot action as:

[TABLE]

where $a^{t}$ will greedily maximize the human’s belief in $\theta^{*}$ at the subsequent timestep. The human takes an action $u^{t}$ based on what they have previously learned; the human actions are therefore observations on the human’s state, i.e., ${u^{t}=h(b^{t})}$ . For example, the human’s action could be completing a test about the target models, or performing the task themselves. Here we consider the simplest case, where:

[TABLE]

Hence, the human feedback $u^{t}$ provides their actual belief over the target models at the current timestep. The robot observes the human state $b^{t}$ from (16), and selects action $a^{t}$ with (15) to shift the human towards the desired state $b^{t+1}$ .

V-B Multiple Learning Strategies

Consider cases where the robot is teaching this Bayesian human, but does not know the human’s learning strategy $\psi^{*}$ . When $\psi^{*}$ (and therefore the future state $b^{t+1}$ ) is unknown, teaching is analogous to controlling an agent with unknown state dynamics [20]. Define $\hat{b}$ as the robot’s prediction of the human’s state given the history of actions and world states:

[TABLE]

Since the human performs Bayesian inference (13), and recalling that the robot observes $b^{t}(\theta)$ , we equivalently have:

[TABLE]

Within the above, $b(\psi)$ is the robot’s belief over the human’s learning strategies. For the state-of-the-art, the robot estimates the human’s learning strategy as $\psi^{0}$ , so that (18) reduces to:

[TABLE]

Instead, we here argue that the robot should teach with a belief over multiple learning strategies. Let $b^{0}(\psi)$ be the prior over what $\psi^{*}$ is. If the robot never updates this initial belief, then the predicted human state after action $a^{t}$ becomes:

[TABLE]

Comparing (20) to (19), now the robot reasons about how its actions are interpreted by each learning strategy. When selecting the action $a^{t}$ with (15)—where we replace $b^{t+1}$ with prediction $\hat{b}^{t+1}$ —this robot teaches across multiple strategies.

V-C Inferring the Learning Strategy

Because the robot is getting feedback from the user, however, we can also infer that specific user’s learning strategy, $\psi^{*}$ . Learning about $\psi^{*}$ provides a solution to teaching with strategy uncertainty (Definition 2), and results in robots that adapt their teaching to match the human. Let us formally define the robot’s belief over learning strategies as:

[TABLE]

We use the subscript $t$ instead of ${t+1}$ since $b^{t}(\psi)$ does not actually depend on $a^{t}$ , as we will show. Applying Bayes’ rule:

[TABLE]

Recalling that the human’s learning strategy is not altered by the robot, $P(\psi~{}|~{}a^{0:t},x^{0:t})=P(\psi)$ . Moreover, because the human is a Bayesian learner, and $u^{t}=b^{t}$ , here $u^{t}$ depends on $u^{t-1}$ , $a^{t-1}$ , $x^{t-1}$ , and $\psi$ (13). Hence, (22) simplifies to:

[TABLE]

Intuitively, (23) claims that the belief over learning strategies is updated based on the differences between the human’s actual state (left side of the likelihood function) and the predicted human state given $\psi$ (right side of the likelihood function)111We used Kullback-Leibler (KL) divergence [21] to define the likelihood of $u^{t}$ given the right side of (23), but other options are possible.. By observing $u$ , we can use (23) to infer the human’s learning strategy. By then substituting (23) back into (18), the robot learns about $\psi^{*}$ while teaching the human $\theta^{*}\,$ : thus, using (18) with (23) addresses teaching with strategy uncertainty.

V-D Teaching Example

Here we provide an illustration of how reasoning over multiple learning models can improve teaching with uncertainty. As shown in Fig. 4, the robot is moving towards goal position $\theta^{*}$ , and wants to teach that goal to the nearby human. At each timestep $t$ , the robot’s action $a$ is an incomplete trajectory (e.g., see the three trajectory segments in Fig. 4). After observing this robot trajectory, the human updates their belief over $\theta^{*}\,$ ; specifically, the human applies Bayesian inference to determine whether the robot’s goal is the cup or the plate. The robot uses (15) with prediction $\hat{b}^{t+1}$ to select the trajectory $a$ which will teach the human the most about $\theta^{*}$ .

We consider two possible learning strategies $\psi\in\Psi$ for the simulated end-users. Humans with $\psi_{1}$ learn best from legible (i.e., exaggerated) trajectories [22]: $\pi(\psi_{1})=(0.1,0.3,0.45)$ if the robot moves directly towards $\theta$ , slightly exaggerates, or fully exaggerates, respectively. By contrast, under $\psi_{2}$ the user learns best from predictable (i.e., goal-directed) trajectories, such that $\pi(\psi_{2})=(0.35,0.2,0.15)$ if the robot moves directly towards $\theta$ , slightly exaggerates, or exaggerates, respectively. The robot does not know which strategy a given user selects.

Prediction Method. We compare (18), (19), and (20). Let $\psi_{1}$ denote a robot which predicts that every user learns with (19), where $\psi^{0}=\psi_{1}$ . Likewise, $\psi_{2}$ is a robot that estimates $\psi^{0}=\psi_{2}$ . The Prior robot reasons over both learning strategies using (20), and our proposed Learn robot solves teaching with strategy uncertainty by leveraging (18) with (23).

Simulation. The robot observes the human action $u$ —i.e., the human’s current belief—and selects an action $a$ using (15) and its prediction method. The robot can select between $6$ different legible or goal-directed trajectory segments ( $3$ for each goal $\theta$ ). The human is a Bayesian learner. Our results (averaged across $10^{5}$ simulated users) are depicted in Figs. 5 and 6.

Analysis. Robots using our proposed Learn approach more quickly taught $\theta^{*}$ than with the fixed teaching methods $\psi_{1}$ , $\psi_{2}$ , or Prior. Reasoning over human learning strategies led to better teaching during a single interaction (see Fig. 5). For multiple iterations, we tested practical scenarios where the robot has the wrong prior: in every case, Learn yielded the fastest convergence, and taught as well as the ideal teacher after $\approx 5$ timesteps (see Fig. 6). Intuitively, the Learn robot gradually shifted to teaching with either legible or predictable trajectories, while the Prior robot continued to compromise between both strategies instead of adapting to the specific user.

V-E Active Teaching

Like we saw in the previous example, learning about the human’s learning strategy $\psi^{*}$ can improve the robot’s teaching. Hence, we here focus on selecting robot actions which actively gather information about $\psi^{*}$ , so that the robot more quickly adapts its teaching to the end-user. Let us formulate teaching with strategy uncertainty as a partially observable Markov decision process (POMDP) [16]: the state is $\big{(}b^{t}({\theta}),\theta^{*},\psi^{*}\big{)}$ , the action is $(a^{t},x^{t})$ , the observation is $u^{t}$ , the state transitions with (13)—where $\theta^{*}$ and $\psi^{*}$ are constant—the observation model is (23), and the reward is $b^{t}(\theta^{*})$ . Solving this POMDP causes the robot to optimally trade-off between exploring for more information about $\psi^{*}$ and exploiting that information to maximize the human’s belief in $\theta^{*}$ . When solving this POMDP is intractable, we can more simply perform active teaching by favoring actions that gather information about $\psi^{*}$ [23]:

[TABLE]

In the above, $\lambda\geq 0$ , and $H$ is the Shannon entropy. Comparing (24) to (15), now the robot selects actions to disambiguate between the possible learning strategies (i.e., reduce the entropy of the robot’s belief over $\psi$ ). Intuitively, we expect a robot that is actively teaching with (24) to select actions, $a$ , which cause users with different learning strategies to respond in different ways, allowing that robot to more easily infer $\psi^{*}$ .

VI Robot Learning Simulations

To compare our learning with strategy uncertainty against the state-of-the-art in a realistic problem setting, we performed a simulated user study. We here consider an instance of inverse reinforcement learning (IRL): the human demonstrates a policy, and the robot attempts to infer the human’s reward function from that demonstrated policy [3, 4, 5]. Unlike the example in Section IV-C, now $\theta^{*}$ (the human’s reward parameters) and $\phi^{*}$ (the human’s demonstration strategy) lie in continuous spaces. We compared robots that learn $\theta^{*}$ with a constant point estimate of $\phi^{*}$ to our proposed method, where the robot learns about both $\theta^{*}$ and $\phi^{*}$ from the human. To test the robustness of our method within more complex and challenging scenarios, we also introduced noisy end-users, who did not follow any of the modeled teaching strategies.

VI-A Setup and Simulated Users

Within each simulation the human and robot were given an 8-by-8 gridworld (64 states). The state reward, $R(x,\theta)$ , is the linear combination of state features $f(x)$ weighted by $\theta$ , so that $R(x,\theta)=\theta\cdot f(x)$ . The human knows $\theta^{*}$ , and provides a demonstration $\pi(u~{}|~{}x,\theta^{*},\phi^{*})$ . This demonstration is a policy, where the human labels each state $x$ with action $u$ ; actions deterministically move in one of the four cardinal directions. The discount factor—which defines the relative importance of future and current rewards—was fixed at $\gamma=0.9$ .

Our setting is based upon previous IRL works [5], where this problem is more formally introduced as a Markov decision process (MDP). These prior works typically assume that the human’s demonstrated policy approximately solves the MDP, i.e., maximizes the expected sum of discounted rewards [8, 9]. By contrast, we here considered users with a spectrum of demonstration strategies. Let $Q(x,u,\theta)$ be the reward for taking action $u$ in state $x$ , and then following the optimal policy for reward parameters $\theta$ . We define the probability that the simulated user takes action $u$ given $x$ , $\theta^{*}$ , and $\phi^{*}$ as:

[TABLE]

where $\phi^{*}\in[-1,1]$ , and $x^{\prime}$ is the state reached after taking action $u$ in state $x$ . When $\phi^{*}=0$ , (25) is the same as the observation model from [8, 9]. As $\phi^{*}\rightarrow+1$ , the human biases their demonstration towards states that have locally higher rewards; conversely, when $\phi^{*}\rightarrow-1$ , the human favors states with lower rewards. Sample user demonstrations with different teaching strategies are shown in Fig. 7.

VI-B Independent Variables

We compared four different approaches for learning $\theta^{*}$ from the user’s demonstration: $\phi^{*}$ , $\phi=-1$ , $\phi=+1$ , and Joint. Under $\phi^{*}$ the ideal robot knows the human’s true teaching strategy, while $\phi=-1$ and $\phi=+1$ indicate robots which assume that the human’s demonstration is biased towards low-reward or high-reward states, respectively. Joint refers to a robot which attempts to learn both $\phi^{*}$ and $\theta^{*}$ from the human’s demonstration, as discussed in Section IV-B.

To see how these approaches scale with the length of the feature vector, $f\in F$ , we performed simulations with $|F|=4$ , $8$ , and $16$ features. In practice, each state $x$ was randomly assigned a feature vector with $|F|$ binary values, indicating which features were present in that particular gridworld state.

Finally, to test how well the robot learned when the human demonstrations were imperfect, we varied the value of $\alpha$ in (25). Parameter $\alpha$ represents how close to optimal the human is: as $\alpha\rightarrow 0$ , the human becomes increasingly random, while the human always chooses the best action when $\alpha\rightarrow\infty$ .

We simulated $100$ users for each combination of $|F|$ and $\alpha$ , where the users’ teaching strategies were uniformly distributed in the continuous interval $\phi^{*}\in[-1,1]$ . The gridworld and $\theta^{*}$ were randomly generated for each individual user.

VI-C Dependent Measures

For each simulation we measured the robot’s learning performance in terms of Reward Error, Strategy Error, and Policy Loss. Reward Error is the difference between the robot’s mean estimate of $\theta^{*}$ and the correct reward parameters: $\|\theta^{*}-\hat{\theta}\|_{1}$ . Similarly, Strategy Error is the error between the robot’s mean estimate of $\phi^{*}$ and the user’s actual teaching strategy: $|\phi^{*}-\hat{\phi}|$ . Policy Loss measures how much reward is lost by following the robot’s learned policy (which maximizes reward under $\hat{\theta}\,$ ) as compared to the optimal policy for $\theta^{*}$ [8]. The code for our examples and simulations can be found at https://github.com/dylanplosey/iact_strategy_learning.

VI-D Results and Discussion

We performed a mixed ANOVA with the number of features and value of $\alpha$ as between-subjects factors, and the learning approach as a within-subjects factor, for both Policy Loss and Reward Error (see Figs. 8 and 9). Since we found a statistically significant interaction for both dependent measures ( $p<.05$ ), we next determined the simple main effects.

Simple main effects analysis showed that Joint resulted in significantly less Policy Loss than either $\phi=-1$ or $\phi=+1$ for each different combination of $|F|$ and $\alpha$ ( $p<.05$ ). We similarly found that Joint resulted in significantly less Reward Error ( $p<.001$ ) for every case except $|F|=16$ , $\alpha=5$ ; here there was no statistically significant difference between Joint and $\phi=+1$ ( ${p=.498}$ ). These results from Figs. 8 and 9 suggest that learning while maintaining a distribution over $\phi$ results in objectively better performance than learning with a fixed point estimate of $\phi^{*}$ .

Next, we investigated how well the Joint method learned the individual users’ teaching strategies. We performed a two-way ANOVA to find the effects of $|F|$ and $\alpha$ on the Joint robot’s Strategy Error (see Fig. 10). We found that the number of features ( $F(2,891)=23.813,p<.001$ ) and the human’s $\alpha$ ( $F(2,891)=22.679,p<.001$ ) had a significant main effect. Post-hoc analysis with Tukey HSD revealed that $|F|=16$ and $\alpha=20$ led to significantly higher Strategy Error than the other values of $|F|$ and $\alpha$ , respectively. As shown in Fig. 10, the robot had larger Strategy Error for higher values of $\alpha$ because it was unable to distinguish between teachers with $\phi^{*}>0$ ; i.e., these different teachers provided similar policy demonstrations when $\alpha=20$ .

Finally, we conducted a followup simulation in which we introduced unmodeled noise (see Fig. 11). Here $|F|=8$ and $\alpha=10$ , but we now increased the ratio of the human taking completely random actions, which were not modeled in (25). Joint resulted in significantly less Policy Loss than $\phi=-1$ or $\phi=+1$ , even as the ratio of unmodeled user noise increased. Hence, reasoning over multiple strategies still improved performance for cases where the noisy end-user did not comply with any of the modeled teaching strategies.

VI-E Challenges and Limitations

Although this simulated user-study supports learning with strategy uncertainty, there are still practical challenges that may limit our proposed approach. In particular, the end-user’s actual interactions may not match any of the learning or teaching models, such that $\phi^{*}\notin\Phi$ or $\psi^{*}\notin\Psi$ . Having the wrong hypothesis space is often unavoidable when using models to learn from humans: but, as we show in Fig. 11, our proposed approach does remain robust to some errors in the hypothesis space (such as noisy users, that do not follow any of the included models). In practice, designers could leverage data from previous trials to construct a richer space of possible interaction strategies, so that $\Phi$ is updated to include $\phi^{*}$ .

VII Conclusion

Because the human’s interaction strategy during HRI varies from end-user to end-user, robots that assume a fixed, pre-defined interaction strategy may result in inefficient, confusing interactions. Thus, we proposed that the robot should maintain a distribution over the human interaction strategies, and exchange information while reasoning over this distribution. We here introduced robot (a) learning with strategy uncertainty and (b) teaching with strategy uncertainty, and derived solutions to both novel problems. We performed learning and teaching examples—as well as learning simulations—and compared our approach to the state-of-the-art. Unlike standard approaches that assume every user interacts in the same way, we found that attempting to infer each individual end-user’s interaction strategy led to improved robot learning and teaching, while remaining robust to unmodeled strategies.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and Autonomous Systems , vol. 57, no. 5, pp. 469–483, 2009.
2[2] X. Zhu, “Machine teaching: An inverse problem to machine learning and an approach toward optimal education.” in AAAI , 2015, pp. 4083–4087.
3[3] A. Y. Ng, S. J. Russell et al. , “Algorithms for inverse reinforcement learning.” in Int. Conf. Machine Learning (ICML) , 2000, pp. 663–670.
4[4] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in Int. Conf. Machine Learning (ICML) , 2004.
5[5] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters, “An algorithmic perspective on imitation learning,” Foundations and Trends in Robotics , vol. 7, no. 1-2, pp. 1–179, 2018.
6[6] J. Choi and K.-E. Kim, “MAP inference for Bayesian inverse reinforcement learning,” in NIPS , 2011, pp. 1989–1997.
7[7] D. P. Losey and M. K. O’Malley, “Including uncertainty when learning from human corrections,” in Conf. on Robot Learning (Co RL) , 2018, pp. 123–132.
8[8] D. Ramachandran and E. Amir, “Bayesian inverse reinforcement learning,” Urbana , vol. 51, no. 61801, pp. 1–4, 2007.