Multiperspective Teaching of Unknown Objects via Shared-gaze-based   Multimodal Human-Robot Interaction

Daniel Weber; Wolfgang Fuhl; Enkelejda Kasneci; Andreas Zell

arXiv:2303.00423·cs.HC·March 2, 2023

Multiperspective Teaching of Unknown Objects via Shared-gaze-based Multimodal Human-Robot Interaction

Daniel Weber, Wolfgang Fuhl, Enkelejda Kasneci, Andreas Zell

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multimodal human-robot interaction method using shared gaze and augmented reality to teach robots new objects, enabling quick learning from few examples and detection of previously unknown classes.

Contribution

It presents a novel pipeline combining gaze and augmented reality for human-guided robot object teaching, with a publicly available dataset for further research.

Findings

01

Robots can learn new objects from few instances using the proposed method.

02

The approach enables detection of classes without prior training data.

03

Different machine learning models and training data sizes influence detection performance.

Abstract

For successful deployment of robots in multifaceted situations, an understanding of the robot for its environment is indispensable. With advancing performance of state-of-the-art object detectors, the capability of robots to detect objects within their interaction domain is also enhancing. However, it binds the robot to a few trained classes and prevents it from adapting to unfamiliar surroundings beyond predefined scenarios. In such scenarios, humans could assist robots amidst the overwhelming number of interaction entities and impart the requisite expertise by acting as teachers. We propose a novel pipeline that effectively harnesses human gaze and augmented reality in a human-robot collaboration context to teach a robot novel objects in its surrounding environment. By intertwining gaze (to guide the robot's attention to an object of interest) with augmented reality (to convey the…

Tables3

Table 1. Table 1: Comparison of all machine learning models trained in a transfer learning fashion. The best values are highlighted in bold.

Model	Backbone	mAP	${mAP}_{50}$	${mAP}_{75}$	${mAR}_{1}$	${mAR}_{10}$	${mAR}_{100}$
Faster R-CNN [36]	ResNet-50 [11]	33.6	66.9	31.4	43.7	50.1	50.4
Faster R-CNN [36]	MobileNetV3 [38, 12]	15.5	38.1	06.1	23.7	27.4	27.7
Faster R-CNN [36]	MobileNetV3 [38, 12]^$†$	13.0	38.4	03.1	22.2	25.5	25.5
FCOS [42]	ResNet-50 [11]	30.6	47.6	35.9	44.7	53.8	55.0
RetinaNet [22]	ResNet-50 [11]	31.2	52.4	34.3	46.2	57.6	59.1
SSD300 [24]	VGG16 [40]	08.0	19.1	05.0	21.0	31.6	34.0

Table 2. Table 2: Comparison of the average precision ( AP ) for different training types of Faster R-CNN on our test set. Namely, apart from the backbone, trained from scratch (S) or trained in the sense of transfer learning with frozen non-classification layers (TL-F) or completely unfrozen (TL-U), respectively. All three on the data collected by the robot. The best values are highlighted in bold. The last column (COCO) serves as an orientation and reports the results of Faster R-CNN trained on the entire MS COCO training set.

Class	S	TL-U	TL-F	COCO
fork	21.7	21.7	18.6	63.8
frisbee	19.5	47.6	58.8	65.9
gamepad	38.4	24.8	62.6	-
hole puncher	42.5	26.3	23.0	-
knife	20.4	19.1	27.6	50.0
scissors	06.5	07.3	05.1	70.9
shuttlecock	24.9	27.8	51.5	-
stapler	29.3	24.8	38.9	-
table tennis ball	03.4	27.6	44.0	17.3
toothbrush	21.6	8.8	6.2	37.4
${mAP}_{50}$	62.8	68.8	66.9	84.4
${mAP}_{75}$	09.8	7.5	31.4	55.4
${mAP}_{}$	22.8	23.6	33.6	50.9

Table 3. Table 3: Results of Faster R-CNN trained via TL-F on different sized subsets of our dataset. The best values are highlighted in bold.

	25%	50%	75%	100%
${mAP}_{}$	31.5	32.9	31.7	33.6
${mAP}_{50}$	64.7	66.7	66.7	66.9
${mAP}_{75}$	28.6	28.4	26.0	31.4
${mAR}_{1}$	41.1	42.9	41.4	43.7
${mAR}_{10}$	46.9	49.9	46.9	50.1
${mAR}_{100}$	47.2	50.2	47.1	50.4

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dnlwbr/multiperspective-teaching
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

Full text

Multiperspective Teaching of Unknown Objects via Shared-gaze-based Multimodal Human-Robot Interaction

Daniel Weber

University of Tübingen

[email protected]

Wolfgang Fuhl

University of Tübingen

[email protected]

Enkelejda Kasneci

Technical University of Munich

[email protected]

Andreas Zell

University of Tübingen

[email protected]

Abstract

For successful deployment of robots in multifaceted situations, an understanding of the robot for its environment is indispensable. With advancing performance of state-of-the-art object detectors, the capability of robots to detect objects within their interaction domain is also enhancing. However, it binds the robot to a few trained classes and prevents it from adapting to unfamiliar surroundings beyond predefined scenarios. In such scenarios, humans could assist robots amidst the overwhelming number of interaction entities and impart the requisite expertise by acting as teachers. We propose a novel pipeline that effectively harnesses human gaze and augmented reality in a human-robot collaboration context to teach a robot novel objects in its surrounding environment. By intertwining gaze (to guide the robot’s attention to an object of interest) with augmented reality (to convey the respective class information) we enable the robot to quickly acquire a significant amount of automatically labeled training data on its own. Training in a transfer learning fashion, we demonstrate the robot’s capability to detect recently learned objects and evaluate the influence of different machine learning models and learning procedures as well as the amount of training data involved. Our multimodal approach proves to be an efficient and natural way to teach the robot novel objects based on a few instances and allows it to detect classes for which no training dataset is available. In addition, we make our dataset publicly available to the research community, which consists of RGB and depth data, intrinsic and extrinsic camera parameters, along with regions of interest.

1 Introduction

As technology progressed, more and more robots were developed for the industrial sector, and their fields of application became diversified. Numerous industries, including automotive, electronics, rubber and plastics, cosmetics, pharmaceutical, and food and beverage, benefit from their superior precision, efficiency, working capacity and tolerance to arduous and hazardous environments [52]. In the immediate environment of humans, robots are also increasingly employed in the form of service assistants, for instance in supermarkets such as Walmart [2, 19] or as tour guides in museums [41, 46]. This success has also been fueled by recent advances in machine learning, particularly in computer vision, which allows robots to understand their environment and detect objects and people within it. However, a core assumption is almost always that a large amount of training data with high quality labels exist. Many large car manufacturers or companies such as Google, Tesla, and Uber have therefore established their own image and video databases, in most cases by outsourcing to crowdsourcing platforms such as Amazon Mechanical Turk [39]. Regarding service robots operating in warehouses or office environments, there are often no publicly accessible datasets that are tailored to the respective environment, comprise all relevant object classes, and are fully labeled. Consequently, state-of-the-art object detectors perform excellently given the existence of sufficient training data, but are limited to deployment in previously specified scenarios predefined by the training data [51]. As soon as an object is not included, it exceeds the capabilities of the object detector and pushes the robot to the limits of its possibilities. In fact, the proportion of objects covered by data sets is vanishingly small compared to the quantity of objects existing in practice. For example, ImageNet [6], one of the largest publicly available image datasets, contains just 1000 classes, while the number of classes existing in the real world obviously far exceeds this number. This fact hinders the deployment of robots in unknown environments and the dynamic adaptation to unfamiliar conditions.

In this work, we aim to make mobile robots not only more capable in terms of the tasks they have to accomplish, but also alleviate data dependency, in the sense that we extend the robot’s basic knowledge by building on an existing general understanding of objects and adding new classes. More specifically, we teach a robot novel, unknown objects to enable it to redetect said objects within the environment in which the learning process took place. To this end, we employ fluent and intelligent human-robot interaction, at the intersection of research fields of computer vision, eye tracking, and augmented reality (AR). By means of the latter, we realize a multimodal communication channel, using human gaze to direct attention to an object, and speech or gestures to convey the relevant class information to the robot. Subsequently, the robot visually segments the object of interest and takes a series of images of the object from slightly different angles. The data obtained in this user-friendly and convenient process is rich in information and encompasses extrinsic and intrinsic camera parameters, as well as regions of interest, in addition to RGB and depth images. In conjunction with the class information provided by the human, the robot learns the respective object accordingly.

In summary, our main contributions are as follows:

We propose a novel pipeline to teach a robot new, yet unknown objects. 2. 2.

Towards this goal, we combine gaze and augmented reality in a human-robot interaction scenario to enable a feasible and swift acquisition of large amounts of labeled training data. 3. 3.

We evaluate the learning process in detail with multiple models, different learning methods, and with varying amounts of data. 4. 4.

We present Objects in Multiperspective Detail (OMD), a versatile dataset, and make it publicly available to the research community under https://cloud.cs.uni-tuebingen.de/index.php/s/2oRPs2o3FZkdBHW. 5. 5.

We make our system with our entire code base publicly available to the research community under https://github.com/dnlwbr/Multiperspective-Teaching.

2 Related Work

Engaging multimodal human-robot interaction to teach a robot unknown objects requires significant multidisciplinary efforts, which we will discuss in this section. From

augmented reality in robotics, and
previous applications of eye tracking in the context of computer vision, to
unknown object detection, to
how robots learn and
collaborate with humans.

2.1 Augmented Reality in Robotics

With the increasing popularity of augmented reality devices, industrial applications and research also expanded [25]. Especially the combination of AR, with robotic-assisted surgeries showed potential [32], as the human remains in control via AR, but can take advantage of the precision and consistency of the robots [44]. In terms of robot control and path planning, [18] controlled an industrial robot arm in pick-and-place tasks using an AR device and [34] designed an AR interface to plan, preview and execute the trajectory of a robot arm. In multi-agent systems, AR has also been used for visual feedback [47] and remote control [35] of robot swarms. Enhancing the perception of the real world through AR has thus proven to be an appealing way to communicate with robots.

2.2 Eye Tracking and Computer Vision

Eye tracking has also become an important tool in both research and industry [16]. For this reason, [16] developed an eye tracking software that works on mobile devices such as mobile phones or tablets and does not require any sensors other than a camera. The authors of [31], analyzed the mobile 3D eye tracking data using computer vision (and augmented reality). This involved tracking 3D markers and aligning them with virtual proxies. In [43], the class of objects was identified by matching the features of known objects with features in the neighborhood of human fixations. However, this required the object to appear in the database previously created specifically for this purpose, and it was also not possible to make statements about the position of the objects. Nevertheless, eye tracking data can help to improve the performance of segmentation algorithms [28]. Thus, in [30], fixations were used to train a model to annotate object locations, and in [50] gaze was incorporated into the segmentation process of a point cloud. Here in turn, in both cases it was not possible to make a statement about the object classes.

2.3 Unknown Object Detection

The problem of unknown object detection was also addressed using eye tracking in [51]. In this work, the number of candidate bounding boxes of a region proposal method was significantly reduced, but a classification was not possible. Using heat maps instead of scene frames, [49] categorized video segments based on whether a person was looking at an object and determined the parameters of the associated bounding box. The detected objects were all unknown, but again the classes were not determined. The authors of [13] addressed the problem of unknown object detection using a one-class support vector machine. Since the learning process was incremental, multiple robots were involved, connected to each other via a cloud-based station where all the processing took place. In this approach, unknown objects were only identified as unknown without adding the class information to the learning process. The purpose was to filter the unknown objects from the classified objects and forward only known objects in order to avoid sending incorrect information to the robots. In another recent work, [20] proposed to exploit additional predictions of semantic segmentation models and quantify the uncertainty of the proposed segmentations. Again, the classification task was binary and only the categories known and unknown were determined without eventually learning the objects.

2.4 Teaching of Robots and Machines

Robots are employed in a wide range of applications, especially in industry. These include teaching assembly tasks [48], where the robot learns from human demonstration: First the robot observes the human, then it imitates the human’s movements. A different way of learning through user interaction is proposed in [10]. In this approach, natural language is used to explain to the robot, which tasks it should perform. One such typical task is grasping objects. Solving this challenge is part of current research such as [14], where an object detection approach was used to learn good grasping poses. Data driven approaches, as stated in [3], are often addressed by providing training data in form of labeled examples, by trial-and-error, or through human demonstration. This means that communication between human and robot is also important here for a flexible learning progress without prior offline data generation.

2.5 Joint Attention in Collaborative Settings

Teaching in collaborative scenarios between human and robot has been investigated by [17] and [8], among others. In [17], natural language context for one-shot learning of visual objects has been used to enable a robot to recognize a described object. In this proof of concept, however, the objects had to be unambiguously distinguishable by color or spatial relationships, and the component parts also had to be uniquely describable by linguistic expressions. In [8], a teaching system for object categorization was proposed. This system allowed the user to visualize the intermediate states of categorization, that is, to which category the robot would assign an object. Through interaction, the categorization could be improved and corrected, but all objects had to be marked with AR markers in order to be recognized at all. In combination with picking tasks, [45] and [5] also taught a robot new objects. In both cases, however, exactly the same objects of each class were used for both training and testing, which positively biases the results. We use several different objects per class, which is more in line with real-world conditions.

In this work, we combine the different research areas of eye tracking and AR for a simple and natural interaction between robot and human to jointly solve a computer vision problem.

3 Method

The goal of this work is to teach a robot unknown objects in its environment, in such a way that the robot is later capable of detecting these objects in this same environment. To this end, we will explain below 1) how we communicate with the robot through the modalities vision, gaze, speech, and gestures 2) how the robot identifies the object of interest, 3) how the human teaches the robot, and 4) how the robot eventually manages to learn.

3.1 Augmented Reality Interface

In order to enable the human to teach the robot anything, a communication channel is mandatory. As suggested by [50], we meet this need in the form of an augmented reality interface. The entire communication between human and robot takes place via this interface, that is, the robot can be controlled, the gaze information of the human can be transmitted and the respective class information of the objects can be conveyed. On the human side, we deploy the HoloLens 2, which is a pair of head-mounted augmented reality glasses manufactured by Microsoft with a built-in eye tracker. We developed the interface, i.e. the HoloLens application, using the game engine and real-time development platform Unity, version 2019.4.36. In addition, we used assets from the Microsoft Reality Toolkit, MRTK 2.7.2, which Microsoft supplies specifically for this purpose. The data interconnection between the Universal Windows Platform (UWP) app on the HoloLens and the robot operating system, ROS [33], takes place via ROS# [1]. The open-source software library ROS# exchanges JSON based commands with ROS through the rosbridge_suite from within Unity applications. Both the HoloLens and the robot are continuously connected with each other via WiFi, and data, such as the human’s gaze information, can be sent and received in real time. Finally, gestures and speech serve to operate the AR interface and to interact with the robot. Figure 1 illustrates how the AR interface acts as a bridge between human and robot and shows the interplay of the individual components of our teaching pipeline, which we will describe in more detail below.

3.2 Identifying the Unknown Object of Interest

In order for the robot to learn a new object, it has to identify it as such in the first place. This is quite a fundamental problem, as it is, in a sense, a chicken-and-egg problem. For the robot, it is difficult to detect the object of interest as it does not know it at this point and it is yet to be taught. Therefore, since the robot must identify the object before it has learned it, the deployment of neural networks is not possible at this point, and determining where the object begins and where it ends is not trivial. Instead, we want to incorporate the human’s gaze information to help the robot locate the target object. This means that the human looks at the object, whereupon the robot can distinguish it from the rest of the environment. For this purpose, we take the approach of [50] as a basis, who segmented observed objects using human gaze and the point cloud obtained from the depth sensor of the robot’s scene camera. Thereby, a calibration determines the respective position of the robot and HoloLens, and the HoloLens’ motion sensors ensure that the mutual position is tracked during human movements. In addition, the gaze point, that is, the point at which the human is looking, is continuously tracked by the HoloLens and published as a point in 3D space through our user interface using ROS#. Thus, the corresponding ROS topic can be subscribed by the ROS system of the robot, which means that the gaze point is known to the robot at all times and can be used for segmentation. In the first instance of the segmentation described in [50] a pass through filter and a voxel grid filter are applied to reduce the size of the point cloud. Subsequently, the ground is extracted using RANSAC and eventually the object is isolated by means of the gaze point and Euclidean clustering. We adapt this method with slight adjustments in the last step. Instead of assigning only the cluster closest to the gaze point to the object, we consider all clusters within a certain distance and with a certain size. We set the threshold for the maximum distance between cluster and gaze point to 2 cm and the minimum cluster size to five points. This way it is possible to even segment very flat objects that do not protrude far from the ground. Depending on which object the person is currently looking at, the respective object is then segmented in real time. All the mentioned point cloud processing is accomplished using the open-source Point Cloud Library (PCL) [37].

Owing to the calibration carried out at the beginning, the position of the object of interest is known both from the location of the human wearing the HoloLens as well as from the location of the robot. The former allows the robot to display its feedback regarding the segmented object as a 3D bounding box on the HoloLens, namely in the human’s field of view, using a subscriber, attached to a virtual bounding box, that updates the position of the box depending on the segmentation results. We can take advantage of the latter during the teaching process described below.

3.3 Teaching through Joint Attention

The attention of the robot and the human is now jointly directed at one and the same object. The next step is to confirm to the robot that the framed object is the object of interest and to provide a class information. By performing a pinching gesture with the index finger and thumb (within the field of view of the HoloLens) during the fixation of the object with the human eyes, we can select the object via the AR interface. Alternatively, it is also possible to just say “select”. Thereupon, a virtual keyboard appears in the human’s field of vision, on which he can now enter the class name of the object. Again, it is alternatively possible to simply resort to speech. Figure 2 shows the implemented keyboard from the human perspective.

Our next goal is to have the robot autonomously capture images of the object of interest, which it can later use as training data. In order to get as many images from multiple angles as possible, we attach a second camera to the wrist of the robot’s arm. The robot is now supposed to move this camera in a circle around the object. This means that once the human has transmitted the class name to the robot by means of a ROS action, it calculates a circular trajectory of reachable points. Due to physical limitations, such as the length of the arm, this is usually a partial segment of the circle. During the movement, the camera is aligned in such a way that it points at a 45 degree angle to the center of the previously determined 3D bounding box. As the distance between the center and the camera, we use twice the length of the diagonal of the bounding box with a minimum safety distance of twice the distance between the camera and the robot arm end effector. The recording process is illustrated in Figure 3.

By virtue of the calibration described in the previous section, the robot is not only aware of the position of the object of interest, but also capable of computing the transformation to all of its coordinate systems, namely the robot frames. This includes the camera on the robot’s arm. The essential aspect in this step is to continuously transform the point cloud of the segmented object into the coordinate system of the camera. By projecting the point cloud onto the 2D image plane of the camera, we can derive the region of interest from the boundary points. Thus, for each captured image, we can additionally store a 2D bounding box calculated in this way. Overall, the stored synchronized data are RGB and depth images as well as the regions of interest with the 2D bounding boxes. In addition, we also store the positions of the camera to the captured object. Eventually, the robot is able to automatically produce hundreds of labeled training images in a remarkably brief period of time. More precisely, teaching one object takes about one minute and yields about 300 images. The progress and completion of the recording process is in turn transmitted to the HoloLens via the ROS action, visualizing to the human when the robot is ready for a new object.

3.4 Transfer Learning

Following the teaching part, the learning process of the robot now ensues. To enable the robot to independently detect the previously seen objects in the future, it must use the information at its disposal in the form of the training data it has created itself. In other words, the robot, or rather its neural network based object detectors, will be trained on the RGB images obtained. This will be accomplished by means of transfer learning. Hence, we assume some prior awareness of objectness, since our method should be seen as an extension rather than a replacement for the training with common large datasets, such as ImageNet [6], PASCAL VOC [9] or MS COCO [23]. Such an approach is realistic, since most of the existing objects are not part of these datasets. We aim to extend this state of knowledge with our method. Consequently, we resort to state-of-the-art object detectors, such as Faster R-CNN [36] and FCOS [42]. Starting from one of these pretrained models respectively, we delete the last classification layers, and then reinitialize them with the appropriate number of output neurons for our use case. Finally, we retrain said layers with our data, freezing all other neurons from the preceding feature layers. By freezing the feature layers responsible for the general comprehension of objects and fine-tuning only the last few layers, we prevent overfitting [53, 55]. In fact, since we only retrain the classification heads of the models, even training on the robot itself is possible without relying on a high-performance GPU. Naturally, the training may take longer. In Section 5, we will evaluate the performance of the aforementioned models, among others.

4 Dataset: Objects in Multiperspective Detail

The method introduced in the previous section allowed us to create a new type of dataset. While we publish the validation and test set mainly for the sake of reproducibility of our results, we think that our training set might be especially interesting for further purposes. In the following, we would like to explain the training data in more detail and provide some statistics. We will elaborate on the validation and test set within the scope of our evaluation in Section 5.

Our training set is particularly characterized by the fact that it supplies many details about individual objects. While most object detection datasets often consist of many images with different objects, our data depicts the objects from many different angles. This is especially attractive for objects that appear different from the front and back, such as a gamepad.

The set consists of 3113 perspectives in total and contains the classes fork, frisbee, gamepad, hole puncher, knife, scissors, shuttlecock, stapler, table tennis ball and toothbrush. For each class, there are two different entities in the set, differing in color, shape, or both. Figure 4 shows some sample images and Figure 5 illustrates the distribution of the viewpoints.

The objects are each placed individually on a table and for every camera perspective multiple pieces of information are included. For each RGB image, alongside the region of interest, there is a corresponding depth image that is aligned to the color image. All RGB images and depth images have a resolution of $1920\times 1080$ . In addition, all camera poses are available. They are specified independently of the robot as a transformation, composed of translation vector and rotation quaternion, from the coordinate system of the camera to the one of the object. The last component is the meta-information about the camera with the intrinsic parameters of the camera calibration.

Altogether, we believe that the high information density in our data is also interesting for other research areas where camera positions are crucial, such as Neural Radiance Fields [27, 54, 21, 7]. There, either synthetic data must be used or, given real data, the camera positions (and depths) can only be roughly approximated via structure from motion. For this reason, we make our dataset, Objects in Multiperspective Detail (OMD), publicly available to the research community.

5 Evaluation

For all our experiments, alongside the aforementioned HoloLens 2 worn by the human, we employed a Scitos G5 from MetraLabs [26] as robot. The body camera through which the robot observes the scene and performs the segmentation is an Azure Kinect DK from Microsoft. The robot arm that was additionally installed on the Scitos is a Kinova Jaco2 [15] with 6 DoF. The camera attached to the wrist of the arm in order to take pictures of the objects is an Intel RealSense D435.

For training we use our own training set, which we have explained in detail in Section 4. Since the objective of this work is to teach the robot its environment, we also had to record a validation and test set located in the environment where the learning process took place. As mentioned earlier, alongside the training set, we will also publish the validation and test set to ensure the reproducibility of our results. The validation and test set consist of 1051 and 1410 regular images, respectively, which were manually labeled by hand using DarkLabel [4]. The classes represented therein are the same ten as in the training set. For each class, four distinct objects of the respective category were available. The validation set contains the same two objects of each class as the training set, whereas the test set contains the other two. That is, the objects differ in shape, color, or both from those used in training. The objects were photographed randomly grouped (within the set) in the robot’s office environment. We made sure to create challenging scenarios as well, such as items being stacked or the toothbrush still being in its packaging, as shown in Figure 6.

In the following, we evaluate our learning pipeline using several state-of-the-art object detectors. Consequently, the object detectors Faster R-CNN [36], RetinaNet [22], FCOS [42], and SSD300 [24] serve as a foundation. We complement these with various backbones, such as ResNet-50-FPN [11], VGG16 [40] and MobileNetV3 Large [38, 12]. All backbones were trained on ImageNet and can be left as is, since we deliberately picked object classes for our evaluation that had no intersection at all with this dataset. The reason behind our choice of the ten test objects was as follows. On the one hand, they must not appear in ImageNet due to the backbones, but on the other hand, at least a part of them ought to be in MS COCO so that we have a comparison later on. Furthermore, within each class there had to be several different looking objects of that class. All of this together limited the selection accordingly, especially since we tried to avoid perishable classes like food. Hence, in terms of the actual object detectors and to ensure that our objects are indeed unknown to the models, we had to train them on a subset of MS COCO. More precisely, we extracted the classes fork, frisbee, knife, scissors, sports ball, and toothbrush from the dataset using the tool Fiftyone [29] and then trained the above mentioned detectors on the remaining part. In doing so, we followed the respective training recipe of the original implementation and, for consistency, adhered thereto in all of our subsequent experiments in our own training pipeline. The only exception in our transfer learning approach was the type of data augmentation applied. In this case, we used random photometric distortion, random zoom out, random cropping, and random horizontal flipping for all models (not just SSD300) to prevent overfitting. Subsequently, we trained using the method described in Section 3.

In all our experiments, we evaluate according to the MS COCO metric [23], namely the average precision for varying intersection-over-union thresholds (IoU). In this context, we use the abbreviations $\text{AP}=\text{AP}^{\text{IoU}=0.5:0.05:0.95}$ , $\text{AP}_{50}=\text{AP}^{\text{IoU}=0.5}$ , and $\text{AP}_{75}=\text{AP}^{\text{IoU}=0.75}$ within a class and mAP as the average over all categories. Analogously, this applies to the average recall, where we consider the maximum recall given 1, 10, and 100 detections per image, respectively, and use the abbreviations $\text{AR}_{1}=\text{AR}^{\text{max=1}}$ , $\text{AR}_{10}=\text{AR}^{\text{max=10}}$ and $\text{AR}_{100}=\text{AR}^{\text{max=100}}$ . Again, mAR denotes averaged over all categories. Unless otherwise stated, average precision and average recall refer to AP and AR, respectively.

A comparison of all tested models is provided in Table 1.

Faster R-CNN with the ResNet-50 backbone generally performed best in terms of average precision. With a MobileNetV3 backbone, the performance was significantly worse in terms of both precision and recall. FCOS and RetinaNet are slightly behind Faster R-CNN in terms of the mean average accuracy. The latter has the best recall values, while FCOS has the best mean average precision at an intersection-over-union of $0.75$ . SSD300 clearly lags behind all other models in terms of precision.

As we proceed, we will continue with Faster R-CNN for further analysis, since MS COCO [23] considers mAP as the single most important metric. In general, the model seems to detect the objects quite well, but some classes cause more difficulties than others. This also becomes apparent by looking at the curve in Figure 7.

Even at low recall values, the precision of the scissors is below $0.5$ . The class toothbrush also decreases early. On the other hand, the precision for the classes gamepad and frisbee is consistently excellent, even for a high recall.

In Table 2, we compare different training variants.

If we ignore the baseline variant (COCO) for a moment, we find that the transfer learning method, in which only the last layers had been trained (TL-F), has the best AP for the majority of classes. In particular, $\text{mAP}_{75}$ is significantly higher. It is worth mentioning that the baseline values (COCO) are naturally superior. The difference between the full MS COCO training set and the part we used for pretraining remains still $25\,713$ objects in $14\,296$ images, which is five times as many images as we used. In addition, our images are distributed among all ten classes, while MS COCO does not contain four of them and the baseline thus does not recognize them at all. This demonstrates the strength of our pipeline, which is designed to enable the learning of additional, as yet unknown classes, that is, to extend existing knowledge. The comparison with the baseline, which is the ideal case, namely 1) a suitable data set exists 2) it is accessible and 3) the object is part of it, serves primarily to better classify our results into the overall picture. It is not intended to outperform a model trained on such a large data set, but rather to determine an upper bound and test how close we can get with our method. Taking this into account, it is remarkable how well our pipeline has learned especially the classes gamepad or shuttlecock, whose AP is even higher than the mAP of the baseline. Furthermore, since the table tennis ball occurs in MS COCO only as a subset of the class sports ball, we can see how our system becomes more attentive to table tennis balls. In contrast, the class frisbee does not quite reach the baseline as the corresponding MS COCO class contains exclusively frisbees. In the case of the category hole puncher, the results are satisfying even without pre-existing basic knowledge (S).

Considering the amount of time needed for the respective training, major differences become apparent. Training of the entire MS COCO training set lasted the longest, at more than two days on two NVIDIA RTX A4000 deployed in parallel. The other three variants were trained on our dataset on only one of the GPUs and took 4 hours for the entire model (S, TL-U) and 2.5 hours for the freezed variant (TL-F), respectively. As mentioned above, although for consistency reasons we trained 26 epochs as in the original recipe, the weights with the best validation accuracy that we eventually used for testing were often reached earlier. For Faster R-CNN trained via TL-F, this was even the case after three epochs (starting at $\text{mAP}=0.0$ before training), which corresponds to a training time of about 40 minutes on our Scitos equipped with an NVIDIA GeForce GTX 1050 Ti. This makes the entire pipeline also suitable for stand-alone learning directly on the robot.

Finally, we analyze the influence of the amount of images used for training. Table 3 lists the results of Faster R-CNN trained using TL-F for varying dataset sizes.

The images were removed from the sequence of perspectives with equidistant spacing. Apart from the case where 75% of the data is used, the tendency emerges that as the number of images increases, so does the average precision and average recall. The best result is obtained using all the data.

6 Limitations & Discussion

Similar to all supervised machine learning methods, we depend on the quality of the training data. In our case, this can vary depending on the preceding segmentation of the point cloud. This in turn is naturally dependent on the quality of the data obtained by the depth sensor of the robot’s scene camera. Especially with very dark or glossy surfaces, we noticed that the depth sensor had problems determining the depth accurately. As a result, the accuracy of the bounding boxes suffers, which eventually has an impact on performance. However, this problem can be compensated with an even larger number of objects and our tests have shown, moreover, that the robot is still capable of detecting the learned objects in its environment despite such difficulties.

One further point is that while our approach generalizes well even to other objects of the learned classes, our tests were inferior on popular datasets such as MS COCO. This is due to the diversity of the images and the versatile situations depicted in them. For instance, fruits such as bananas and apples can be found in their natural form as well as cut into small pieces in a fruit salad. This task can only be solved with an enormous amount of training data. Our method, on the other hand, although it cannot achieve the performance of training on large datasets, is primarily designed to teach the robot objects for which data does not yet exist. In such scenarios, a semantic scene understanding is necessary and humans can assist in gaining this understanding by means of our method. While an extension of existing datasets would also be conceivable, our method, in contrast, does not require tremendous labeling resources and can be used spontaneously in the respective situation. Our evaluations (Table 2) show that our method is capable of learning unknown objects that are not detected by the baseline. It can therefore be used more flexibly without the need to know the situation in advance and rely on the existence of appropriate datasets. Moreover, teaching through two-way interaction is extremely natural, especially since the AR system enables real-time communication between human and robot by directly connecting both worlds, the analog world of the human and the digital world of the robot, so that the human can supply information (gaze, class) to the robot, thus initiating the recording process, and the robot can in turn communicate feedback visually. Pointing to objects using gaze completes the interplay, as it is intuitive, less ambiguous than pointing with a finger and, unlike speech, can be applied before the object is known to the robot. All in all, we therefore consider our approach to be less of a replacement and more of a supplement.

7 Conclusion

In this work, we presented a novel pipeline towards the deployment of robots in non-predefined scenarios. To this end, we leveraged human gaze and augmented reality in the interaction between robot and human to successfully teach the robot new, yet unknown objects in its environment.

In order for robots equipped with machine learning based object detectors to detect their environment and the objects contained therein, a lot of training data is usually required. In practice, however, under unpredictable conditions and due to the wide range of existing objects, the availability of a suitable data set can not always be guaranteed. Our approach can complement popular datasets in exactly such situations and produce large amounts of automatically labeled, non-synthetic training data in a user-friendly manner and in a short period of time. On the basis of such data, we have trained state-of-the-art object detectors in several different ways and shown that it is possible to learn and detect new objects in this manner. In fact, the training can even take place standalone on the robot due to transfer learning, without the need for tremendous computational resources. Further, with a few instances, it was possible for the robot to generalize to unseen objects in the given class and to detect classes that could not be detected by the baseline due to an unsuitable underlying training dataset. This makes our teaching pipeline a valuable extension to training exclusively on standard datasets. Overall, our approach is supremely natural and intuitive by virtue of its multimodality, including AR and the shared gaze of human and robot. The dataset we have recorded in the course of our evaluation is also made publicly available and is characterized by a high level of information density owing to the many different perspectives on the respective object and the data gathered in this process. As a result, it has the potential to be relevant for a variety of purposes, aside from ours.

However, a significant amount of work remains for the future as we plan to investigate the usability of our system in a user study as well as to extend our approach to enable the robot to successfully detect objects outside of its trained environment and to further leverage the acquired knowledge through active learning in another human-robot interaction scenario.

Acknowledgments

This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC number 2064/1 – Project number 390727645. We also thank Julia Dietl for her valuable efforts in labeling.

Bibliography56

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Martin Bischoff. ROS#, June 2019.
2[2] Robert Bogue. Strong prospects for robots in retail. Industrial Robot: the international journal of robotics research and application , 2019.
3[3] Jeannette Bohg, Antonio Morales, Tamim Asfour, and Danica Kragic. Data-driven grasp synthesis—a survey. IEEE Transactions on Robotics , 30(2):289–309, 2013.
4[4] Dark Label. https://github.com/darkpgmr/Dark Label , 2021. Accessed: 2021-12-07.
5[5] Masood Dehghan, Zichen Zhang, Mennatullah Siam, Jun Jin, Laura Petrich, and Martin Jagersand. Online object and task learning via human robot interaction. In 2019 international conference on robotics and automation (ICRA) , pages 2132–2138. IEEE, 2019.
6[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 248–255. IEEE, 2009.
7[7] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12882–12891, 2022.
8[8] Lotfi El Hafi, Hitoshi Nakamura, Akira Taniguchi, Yoshinobu Hagiwara, and Tadahiro Taniguchi. Teaching system for multimodal object categorization by human-robot interaction in mixed reality. In 2021 IEEE/SICE International Symposium on System Integration (SII) , pages 320–324. IEEE, 2021.