Can User-Centered Reinforcement Learning Allow a Robot to Attract   Passersby without Causing Discomfort?

Yasunori Ozaki; Tatsuya Ishihara; Narimune Matsumura; Tadashi Nunobiki

arXiv:1903.05881·cs.AI·January 3, 2020

Can User-Centered Reinforcement Learning Allow a Robot to Attract Passersby without Causing Discomfort?

Yasunori Ozaki, Tatsuya Ishihara, Narimune Matsumura, Tadashi Nunobiki

PDF

Open Access

TL;DR

This study introduces a user-centered reinforcement learning approach enabling social robots to greet passersby effectively without causing discomfort, demonstrated through field experiments at an office entrance.

Contribution

The paper presents a novel reinforcement learning method tailored for social robots to adapt their greetings based on passersby reactions, reducing discomfort.

Findings

01

Robots using the method successfully avoided causing discomfort (p<0.01).

02

Field experiments confirmed the effectiveness of the approach in real-world settings.

03

The approach improved passersby's comfort and attention engagement.

Abstract

The aim of our study was to develop a method by which a social robot can greet passersby and get their attention without causing them to suffer discomfort.A number of customer services have recently come to be provided by social robots rather than people, including, serving as receptionists, guides, and exhibitors. Robot exhibitors, for example, can explain products being promoted by the robot owners. However, a sudden greeting by a robot can startle passersby and cause discomfort to passersby.Social robots should thus adapt their mannerisms to the situation they face regarding passersby.We developed a method for meeting this requirement on the basis of the results of related work. Our proposed method, user-centered reinforcement learning, enables robots to greet passersby and get their attention without causing them to suffer discomfort (p<0.01) .The results of an experiment in the…

Tables3

Table 1. TABLE I : Action set in this experiment

Symbol	Detail
$a_{0}$	Robot waits for 5 secs until somebody comes.
$a_{1}$	Robot calls a passerby with a greeting.
$a_{2}$	Robot looks at a passerby.
$a_{3}$	Robot represents joy by the robot’s motion.
$a_{4}$	Robot blinks the robot’s eyes.
$a_{5}$	Robot says ”I’m sorry.” in Japanese.
$a_{6}$	Robot says ”Excuse me.” in Japanese.
$a_{7}$	Robot says ”It’s rainy today.” in Japanese.
$a_{8}$	Robot says how to start their own service.
$a_{9}$	Robot says goodbye.

Table 2. TABLE II : State set in this experiment

Symbol	Detail
$s_{00}$	The passerby’s state changes ”Not Found” into ”Not Found”.
$s_{10}$	The passerby’s state changes ”Not Found” into ”Passing By”.
⋮	⋮
$s_{56}$	The passerby’s state changes ”Leaving” into ”Established”.
$s_{66}$	The passerby’s state changes ”Leaving” into ”Leaving”.

Table 3. TABLE III : Items of the result after the data cleansing.

items	Before	After	Total
episodes	87	122	209
time[h]	13.7	26.7	40.4
days[d]	3	6	9

Equations5

T_{0} (s)

T_{0} (s)

T_{n + 1} (s)

p (s, a) = \frac{exp ( Q ( s , a ) / T _{n} ( s ) )}{\sum _{a_{i} \in A} exp ( Q ( s , a _{i} ) / T _{n} ( s ) )}

p (s, a) = \frac{exp ( Q ( s , a ) / T _{n} ( s ) )}{\sum _{a_{i} \in A} exp ( Q ( s , a _{i} ) / T _{n} ( s ) )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Reinforcement Learning in Robotics · Robotic Path Planning Algorithms

Full text

Can User-Centered Reinforcement Learning Allow a Robot to Attract Passersby without Causing Discomfort?*

Yasunori Ozaki1, Tatsuya Ishihara2, Narimune Matsumura1 and Tadashi Nunobiki1 This work is supported by NTT Corporation.1Yasunori Ozaki, Narimune Matsumura and Tadashi Nunobiki are with Service Evolution Lab., NTT Corporation, Yokosuka, Japan [email protected], [email protected] and [email protected]2*Tatsuya Ishihara is with the R&D Center, NTT West Corporation, Osaka, Japan [email protected]

Abstract

The aim of our study was to develop a method by which a social robot can greet passersby and get their attention without causing them to suffer discomfort. A number of customer services have recently come to be provided by social robots rather than people, including, serving as receptionists, guides, and exhibitors. Robot exhibitors, for example, can explain products being promoted by the robot owners. However, a sudden greeting by a robot can startle passersby and cause discomfort to passersby. Social robots should thus adapt their mannerisms to the situation they face regarding passersby. We developed a method for meeting this requirement on the basis of the results of related work. Our proposed method, user-centered reinforcement learning, enables robots to greet passersby and get their attention without causing them to suffer discomfort ( $p<0.01$ ) . The results of an experiment in the field, an office entrance, demonstrated that our method meets this requirement.

I Introduction

The working population in many developed countries is decreasing in proportion to the total population due to population aging, and this problem is expected to affect developing countries as well[1]. One approach to addressing this problem is to use social robots rather than people to provide customer services. Such robots, for example, are starting to be used as receptionists, guides, and exhibitors. Robot exhibitors are being used to provide, for example, exhibition services, such as explaining products being promoted by the robot owners. While robots can increase the chance of being able to provide a service by simply greeting passersby[2], passersby can suffer discomfort if they are suddenly greeted by a robot[3]. The robot may thus face a dilemma: whether to behave in a manner that benefits the owner or to behave in a manner that does not discomfort passersby.

Our goal was to develop a method that solves the robot dilemma described above. That is, a method by which a robot can greet passersby and get their attention without causing them to suffer discomfort. We call our proposed method user-centered reinforcement learning.

In the next section, we define the problem and describe how we found an approach to solving it by studying related work. In the “Proposed Method” section, we explain the method we developed for solving the problem. In the “Experiment” section, we explain the experiment we conducted in the field to test two working hypotheses created from the original hypothesis?. The results show that our method can solve the problem. In the “Discussion” section, we examine the results from the standpoints of physiology, psychology, and user experience. In the “Conclusion” section, we conclude that, by using user-centered Q-learning, a robot can increase the chance of being able to provide a service to a passerby without causing the passerby discomfort. We also mention future work to enhance the proposed method.

I-A Related Works

Several researchers have addressed problems that are similar to the problem we addressed. These problems can be categorized in terms of the problem setting, the solution, and the goal.

In terms of the problem setting, the problem we addressed is similar to the problem of human-robot engagement, which is a complex problem. In accordance with human-robot interface studies[4, 5], we can interpret human-robot rngagement as the process by which a robot interacts with people, from initial contact to the end of the interaction. Several researchers have analyzed human-robot engagement[6, 7] and have developed a method for maintaining human-robot engagement during the interaction [8]. We did not tackle the human-robot engagement problem directly; instead, we tackled the problem that precedes it, which is illustrated in Figure 2.

In terms of the solution, the problem we addressed is similar to machine learning, especially reinforcement learning. Reinforcement learning in robotics is a technique used to find a policy $\pi:O\rightarrow A$ [9] and is used for robotic control tasks. It is not used much for interaction tasks. Reinforcement learning has been applied to the learning of several complex aerobatic control tasks for radio-controlled helicopters [10] and to the learning of door opening tasks for robot arms [11]. The research on interaction tasks is less remarkable. Mitsunaga et al. showed that a social robot can adapt its behavior to humans for human-robot interaction by using reinforcement learning [12] if human-robot engagement has been established. Papaioannou et al. used reinforcement learning to extend the engagement time and enhance the dialogue quality [13].

The applicability of these method to the situation before human-robot engagement is established is unclear. As shown in Figure 2, the problem we addressed occurs before engagement is established.

In terms of the goal, the problem we addressed is similar to increasing the number of human-robot engagements. Macharet et al. showed that, in a simulation environment, Gaussian process regression based on reinforcement learning can be used to increase the number of engagements[14]. Going further, we focused on increasing the number of engagements in a field environment.

I-B Problem Statement

We use a problem framework commonly used for reinforcement learning in robotics, the partially observable Markov decision process (POMDP) to define the problem[9]. The robot is the agent, and the environment is the problem. The robot can observe the environment partially by using sensors.

We choose a exhibition service area in an entrance to a company as the environment. We assume the entrance consists of one automated exhibition system, one aisle and other space. In addition, the entrance is expressed as Euclidean space $R^{3}$ . passersby can move freely around the exhibition system.

The automated exhibition system consists of a tablet, a computer, a robot and a sensor system. The sensor system can sense a color image data $I_{t}$ and a depth image data $D_{t}$ . We called these data Observation $O_{t}$ . The sensor system can also extract a partial passerby’s action from $O_{t}$ . The passerby’s action consists of the passerby’s position $\bm{p_{t}}=(x_{t},y_{t},z_{t})$ and the head angle $\bm{\theta_{t}}=(\theta_{t}^{yaw},\theta_{t}^{roll},\theta_{t}^{pitch})$ . We define the times when the passerby enters the entrance ( $t=0$ ) and when the passerby leaves from the entrance ( $t=T_{end}$ ) . We call the interval between $t=0$ and $t=T_{end}$ an episode. Let $\Theta=(\bm{\theta_{0}},...,\bm{\theta_{T_{end}}})$ be the passerby’s position in an episode, and let $P=(\bm{p_{0}},...,\bm{p_{T_{end}}})$ be the passerby’s head angle in the episode.

The proposed method takes an own their action from these passerby’s action.

Let $N_{u}$ be a number of people that used the service. Let $N_{d}$ be a number of people that used the discomfort. Then, we can declare this problem as ”Find a robot’s policy $\pi:O\rightarrow A$ such that $\max(N_{u})$ and $\min(N_{d})$ ”.

I-C Our Approach

We solve this problem by controlling the robot on the basis of reinforcement learning, ordinarily Q-learning except for designing the reward function. The reward function is created by focusing on the user experience of stakeholders. We call this reinforcement learning including this reward function ”user-centered reinforcement learning.” We do not use deep reinforcement learning due to the difficulty at the present time of collecting the huge amount of data needed for learning.

I-D Contributions

The contributions of this work are as follows,

We show that robots can learn abstract actions from a person’s non-verbal responses. 2. 2.

We present a method for increasing the number of human-robot engagements in the field without causing them to suffer discomfort.

II Proposed Method

Proposed method, User-Centered Reinforcement Learning, is based on Reinforcement Learning. In this paper, We use Q-learning, one of reinforcement learning, as a base algorithm because it is easy to explain why the robot choose the past actions by Q-learning. We call this algorithm ”User-Centered Q-Learning” (UCQL). UCQL is differ from original Q-learning[15] in an action set $A$ , a state set $S$ , Q-function $Q(s,a)$ and reward function $r(s_{t},a_{t},s_{t+1})$ . UCQL consists of three functions;

Select an action by a policy 2. 2.

Update the policy based on user’s actions 3. 3.

Design a reward function and a Q function as initial condition.

II-1 Selecting an action by a policy

Generally speaking, robot senses observation, and take an action including wait. Let $t_{a}[sec]$ be the time when the robot acted. Let $t_{c}[sec]$ be the time when the robot compute the algorithm. Let $s_{t}\in S$ be the predicted user’s state on the time $t$ . Let $a_{t}\in S$ be the robot’s action on the time $t$ . In UCQL, robot choose the action by Algorithm 1.

II-2 Update the policy based on user’s actions

In UCQL, robot update the policy by Algorithm 2.

II-3 Designing an reward function

In UCQL, robot is given a reward function with Algorithm 3 . Algorithm 3 divide motivation into extrinsic and intrinsic one inspired from ”Intrinsically Motivated Reinforcement Learning[16]”. We call the proposed method ”User-Centered” because we design an extrinsic motivation from user’s states related User Experience.

II-4 Miscellaneous

•

We can choose optional policy $\pi$ such as greedy, $\epsilon$ -greedy and so on.

•

The Q function may be initialized with a uniform distribution. However, if the Q function is designed to be suitable for the task, the learning speed is faster than that of the uniform distribution.

•

The Q function may be approximated with a function such as Deep Q-Network[17]. However, the learning speed is very slower than that of the designed function.

III Experiment

In this chapter, we aim at showing the hypothesis that ”by using user-centered Q-learning, a robot can increase the chance of being able to provide a service to a passerby without causing the passerby discomfort”.

III-A Concrete Goal

At first, we convert the hypothesis into another working hypothesis by operationalization because we cannot evaluate the hypothesis quantitatively.

In Introduction, we define this problem as ”Find a robot’s policy $\pi:O\rightarrow A$ such that $\max(N_{u})$ and $\min(N_{d})$ ”. We give shape to $N_{u}$ and $N_{d}$ for this experiment. According to Ozaki’s study[3], This knowledge has two important points. Firstly, passerby is not suffer a negative effect by robot’s call if passerby don’t use a robot service. Secondly, passerby is suffer a negative effect by robot’s call if passerby use the robot service. Thus, this is a binary classification problem that passerby who is called by robot uses the robot service or do not use it. we can define a confusion matrix for evaluation of the method. We infer that $N_{u}$ and TP, TN have a positive correlation. We also infer that $N_{d}$ and FP have a positive correlation. We also infer that $N_{d}$ and FP have a positive correlation. On the other hand, we infer that $N_{d}$ and TN have a negative correlation. Therefore, we can use $\textrm{Accuracy}=(TP+TN)/(TP+FP+TN+FN)$ as a index for evaluation because $max(Accuracy)$ is one of another representation of ” $\max(N_{u})$ and $\min(N_{d})$ ”.

From the above discussion, we define the working hypothesis $WH$ as ”The accuracy after a learning by UCQL is better than the accuracy before a learning by UCQL”.

In this experiment, we test $WH$ in order to show that the hypotheses is sound.

III-B Method

In this section, we explain how to conduct the experiment in a field environment. We can divide the method for this experiment into five steps.

Create an experimental equipment 2. 2.

Construct an experimental environment 3. 3.

Define an experimental procedure 4. 4.

Evaluate the working hypotheses by statistical hypothesis testing 5. 5.

Visualize the effect of UCQL

III-B1 Create an experimental equipment

Firstly, we create an equipment including UCQL. The equipment can be explained in the aspect of the physical structure and the logical structure.

Figure 3 is a diagram of the equipment in the view of the physical structure. According to Figure 3, the experimental equipment consists of a table, a sensor, a robot, a tablet PC, a router and servers. The components are connected with Ethernet cable or Wireless LAN. We use Sota111https://sota.vstone.co.jp/home/, a palm-sized social humanoid robot, as a robot. Sota has a speaker to output voices, a LED to represent lip motions, a SoC to control elements and so on. In this experiment, those elements of Sota is used to interact with a participant. The iPad Air 2 is used as a tablet PC into which start the movie on the display. The Intel RealSense Depth Camera D435 222https://click.intel.com/intelr-realsensetm-depth-camera-d435.html is used as an RGB-D sensor device to measure passerby’s actions.

Figure 4 is a diagram of the equipment in the view of the logical structure. The structure consist of Sensor, Motion Capture, State Estimator, Action Selector, Action Decoder, Effector and Policy Updater. We utilize Nuitrack333https://nuitrack.com/ as Motion Capture. And we utilize ROS444http://wiki.ros.org/ as a infrastructure of the equipment to communicate variables among functions.

According to Figure 3 and 4, the equipment works by Algorithm 4.

We utilize Table I as the action set $A$ and Table II as the state set. Table I is a double Markov model created from the state set of Ozaki’s decision-making predictor[3]. Ozaki’s decision-making predictor estimates passerby’s states into seven state: Not Found ( $s_{0}$ ), Passing By ( $s_{1}$ ), Look At ( $s_{2}$ ), Hesitating ( $s_{3}$ ), Approaching ( $s_{4}$ ), Established ( $s_{5}$ ), Leaving ( $s_{6}$ ).

In addition, we utilize $\alpha=0.5$ and $\gamma=0.999$ as learning parameters. And we utilize Soft-max selection as the policy because we want robot to do action that has a high value and to find an action that has a higher value. Soft-max selection is often used for Q-learning. Equation 3 is the possibility to select actions on the policy. we utilize Equation 2 as a policy parameter. $T_{n}(s)$ means a thermometer when it is updated $n$ times on $s$ . $T_{n}(s)$ depends on the states because $s_{00}$ occur many times. we utilize $k_{T}=0.98$ and $T_{min}=0.01$ as learning parameters.

[TABLE]

III-B2 Construct an experimental environment

At first, we have to define how to construct an environment for the experiment. Figure 5 shows a overhead view of the environment. The environment consists of a exhibition space, a wall, a seat space, a way to a W.C. in an building that an actual company have. There are hundreds of employees in the building. Dozens of visitors come to the building. Visitors of the building is often shitting in the seat space for tens of minutes in order to wait for employees in the building. Some visitors and employees watches exhibition space to know newer technologies of the company. Some visitors sometimes go to W.C. while they are waiting for employees.

III-B3 Define an experimental procedure

We suppose the two main scenario. The first scenario is as follows:

A visitor is sitting on a seat in the seat space. 2. 2.

Then, the visitor get up from the seat because the visitor wants to go to W.C.. 3. 3.

Thus, visitor move from the seat space to W.C. across the exhibition space.

The second scenario is as follows:

A visitor is sitting on a seat in the seat space. 2. 2.

Then, the visitor get up from the seat because the visitor is boring to wait. 3. 3.

Thus, The visitor move from the seat space to the exhibition space in order to watch the robots in the equipment.

We wants to attract the passersby in the second scenario mainly. We do not wants to attract the passersby in the first scenario because the visitor wants to go to W.C.. Therefore, because we wants the robot to learn the rules, we let the robot learn the rules on the environment by UCQL for several days. Then, we can get learned Q-funcion $Q_{A}(s,a)$

After the learning, we let the robot attract passersby under two condition. We define two condition: Before Learning and After Learning because we want to test the hypotheses. The robot do not learn during the test.

We start collect data for the evaluation by rosbag555http://wiki.ros.org/rosbag. Each data is recorded by rosbag. We can recode all of values in ROS by rosbag during the procedure.

III-B4 Evaluate by statistical hypothesis testing

We evaluate the working hypothesis $WH$ by statistical hypothesis testing. We calculate the the accuracy before the learning and the accuracy after the learning in order to test $WH$ . Finally, we use the one-sided Test of Proportion because we want to evaluate statistical difference between the the accuracy before the learning and the accuracy after the learning.

III-B5 Visualize the effect of UCQL

We visualize the Q-function before the learning and the Q-function after the learning by heat map in order to analyze the effect of UCQL. UCQL can change the action by updating Q-function. Therefore, we can know how robot learn the action by visualizing Q-function. Figure 6 is an example Q-function to explain a visualization on this paper.

IV Result

We constructed a experiment environment described on Method in the entrance of our buildings. Figure 1 shows a picture of the equipment in the environment. The experimenter was the corresponding author. The participants were a lot of employees and visitors of our company. The learning interval is three days. As a result, we measured a lot of data. We clean the data by the following step because the data have a lot of noise on the field such as detection errors by Motion Capture and so on.

•

We drop episodes that interval is less than 1 [sec] because it takes a 3 [sec] to walk across the detection area of Motion Capture.

•

We drop episodes that is from $s_{00}$ to $s_{00}$ only because nobody was in the detection area of Motion Capture.

We got 209 total episodes in the experiment after the data cleansing. Table III shows number of episodes and time on each condition. We calculated the accuracy from the confusion matrix on each condition. The confusion matrices for the before condition and the after condition were respectively $(\textrm{TP,FP,FN,TN})=(11,59,0,17)$ and $(\textrm{TP,FP,FN,TN})=(7,23,0,92)$ . Therefore, the accuracy of the baseline and proposed methods were respectively 0.322 and 0.811. In testing $WH$ by the one-sided Test of Proportion, we found a significant difference in accuracy between the before and after condition ( $p=4.46\times 10^{-13}<0.01$ ).

V Discussion

We discuss the original hypothesis, ”The robot can attract passersby without users’ discomfort by User-Centered Reinforcement Learning.”, in the point of following views.

Can we accept the original hypothesis? 2. 2.

Why the robot attract passersby without discomfort by the proposed method? 3. 3.

What is the limitations of the method and the experiment?

V-A Can we accept the original hypothesis?

We explain why we can accept the original hypothesis by using the result of the experiment and another study.

At first, we show that the we can accept $WH$ , ”The accuracy after a learning by UCQL is better than the accuracy before a learning by UCQL”. According to Capture IV, we found a significant difference in precision between the before and after condition. Thus, we accept $WH$ . Therefore, we can infer $WH$ as true.

The result of the experiment supports the original hypothesis though the above-mentioned discussion because the working hypothesis is true. Therefore, we can accept the original hypothesis.

V-B Why the robot attract passersby without discomfort by the proposed method?

We can explain why the robot attract passersby without discomfort in view of the learning process with Figure. 8.

Why the robot reduce FN by UCQL? We compare the row of $s_{01}$ in Figure. 8(a) and the row of $s_{01}$ in Figure. 8(b). The robot before learning selected a action $a_{4}$ because $\mathop{\rm arg~{}max}\limits_{a}Q_{B}(s_{01},a)=a_{4}$ . The robot after learning selected a action $a_{0}$ because $\mathop{\rm arg~{}max}\limits_{a}Q_{A}(s_{01},a)=a_{0}$ . That means robot do not calls if passerby don’t use a robot service. Therefore, the robot reduce FN by UCQL.

V-C What is the limitations of the method and the experiment?

In this experiment, we supposed that a passerby do not walk with others. In other words, we do not consider a group of passersby. Thus, we need to expand the method in order to process a group of them.

The data in this study are sampled from biased population. We need to take further experiments on other environments if we want more soundness about the working hypotheses.

In this experiment, we create the reward function based on other studies. However, it is hard to create reward functions on each case. Therefore, we have to create a easy method in order to design reward function and Q function.

VI Conclusion

We investigated the hypothesis that ”by using user-centered Q-learning, a robot can increase the chance of being able to provide a service to a passerby without causing the passerby discomfort.” We proposed a method based on reinforcement learning in robotics and focused on the reward function and the Q-function because we wanted the robot to perform actions in view of user experience?. To investigate our hypothesis, we made a working hypothesis and tested it experimentally. From the results, we accepted the working hypothesis and the original hypothesis.

Future work includes generalizing the method for creating the reward function to make it applicable to different tasks and developing a distributed reinforcement learning method that enhances time-efficiency.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] United Nations, “World population ageing 2015,” 2015.
2[2] C. Shi, S. Satake, T. Kanda, and H. Ishiguro, “Field trial for social robots that invite visitors to stores,” Journal of the Robotics Society of Japan , vol. 35, no. 4, pp. 334–345, 2017.
3[3] Y. Ozaki, T. Ishihara, N. Matsumura, T. Nunobiki, and T. Yamada, “Decision-making prediction for human-robot engagement between pedestrian and robot receptionist,” in 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN) , Aug 2018, pp. 208–215.
4[4] C. L. Sidner, C. D. Kidd, C. Lee, and N. Lesh, “Where to look: a study of human-robot engagement,” in Proceedings of the IUI 2004 .
5[5] M. Sun, Z. Zhao, and X. Ma, “Sensing and handling engagement dynamics in human-robot interaction involving peripheral computing devices,” in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems , ser. CHI ’17. ACM, 2017, pp. 556–567.
6[6] C. L. Sidner and C. Lee, “Engagement rules for human-robot collaborative interactions,” in SMC’03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH 37483) , vol. 4, Oct 2003, pp. 3957–3962 vol.4.
7[7] C. Rich, B. Ponsler, A. Holroyd, and C. L. Sidner, “Recognizing engagement in human-robot interaction,” in 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI) , March 2010, pp. 375–382.
8[8] D. Bohus, C. W. Saw, and E. Horvitz, “Directions robot: In-the-wild experiences and lessons learned,” in Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems , ser. AAMAS ’14. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2014, pp. 637–644.