Sequence-based Multimodal Apprenticeship Learning For Robot Perception and Decision Making
Fei Han, Xue Yang, Yu Zhang, Hao Zhang

TL;DR
This paper introduces SMAL, a novel sequence-based multimodal apprenticeship learning method that fuses temporal and multimodal data to improve robot perception and decision-making in complex scenarios.
Contribution
The paper presents a new approach that integrates temporal and multimodal data for apprenticeship learning, enhancing robot perception and decision-making capabilities.
Findings
SMAL effectively learns robot plans from multimodal observation sequences.
SMAL outperforms baseline methods using individual images.
Validated in both simulation and real-world rescue scenarios.
Abstract
Apprenticeship learning has recently attracted a wide attention due to its capability of allowing robots to learn physical tasks directly from demonstrations provided by human experts. Most previous techniques assumed that the state space is known a priori or employed simple state representations that usually suffer from perceptual aliasing. Different from previous research, we propose a novel approach named Sequence-based Multimodal Apprenticeship Learning (SMAL), which is capable to simultaneously fusing temporal information and multimodal data, and to integrate robot perception with decision making. To evaluate the SMAL approach, experiments are performed using both simulations and real-world robots in the challenging search and rescue scenarios. The empirical study has validated that our SMAL approach can effectively learn plans for robots to make decisions using sequence of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning
Sequence-based Multimodal Apprenticeship Learning For
Robot Perception and Decision Making
Fei Han1, Xue Yang1, Yu Zhang2, and Hao Zhang1 1Fei Han, Xue Yang and Hao Zhang are with the Department of Computer Science, Colorado School of Mines, 1500 Illinois Street, Golden, CO 80401, USA [email protected], [email protected], [email protected]2Yu Zhang with the Department of Computer Science and Engineering, Arizona State University, 699 S Mill Ave, Tempe, AZ 85281, USA [email protected]
Abstract
Apprenticeship learning has recently attracted a wide attention due to its capability of allowing robots to learn physical tasks directly from demonstrations provided by human experts. Most previous techniques assumed that the state space is known a priori or employed simple state representations that usually suffer from perceptual aliasing. Different from previous research, we propose a novel approach named Sequence-based Multimodal Apprenticeship Learning (SMAL), which is capable to simultaneously fusing temporal information and multimodal data, and to integrate robot perception with decision making. To evaluate the SMAL approach, experiments are performed using both simulations and real-world robots in the challenging search and rescue scenarios. The empirical study has validated that our SMAL approach can effectively learn plans for robots to make decisions using sequence of multimodal observations. Experimental results have also showed that SMAL outperforms the baseline methods using individual images.
I Introduction
Apprenticeship learning (AL) has become an active research area in robotics over the past years, which enables a robot to learn physical tasks from expert demonstrations, without the requirement to engineer accurate task execution models. AL has been widely applied in a variety of practical applications, including object grasping [1], robotic assembly [2], helicopter control [3], navigation and obstacle avoidance [4], among others [5, 6, 7, 8]. AL methods automatically learn a mapping from world states to robot actions based on optimal or near optimal demonstrations. These methods can also quantify the trade-off among task constraints, which can be difficult or even impossible for manual task modeling.
Given the advantage of AL, however, most previous techniques focused only on either perception or decision making without good integration between these two key components [9, 6]. It limits the capability of AL methods to address real-world problems when a robot needs to make decisions based upon online observations, especially in cases when the perception data consist of multiple modalities obtained from a variety of equipped sensors. To address this issue, several methods were proposed to integrate perception and planning within the same AL formulation. A promising direction is to utilize images perceived by robot’s onboard cameras as a representation of the current state, and then use supervised learning or reinforcement learning for decision making [10, 11]. However, state representation and recognition based on individual images often suffer from the issue of perceptual aliasing (i.e., multiple distinct states of the world give rise to the same percept), due to their incapability to incorporate temporal information or multimodal observations. Unreliable perception will result in wrong planning and decision making, and possibly fail the tasks.
In this paper, we develop a novel Sequence-based Multimodal Apprenticeship Learning (SMAL) method to integrate spatio-temporal multimodal perception and decision making in the AL scenario. Instead of using individual images, we propose to represent a world state directly as a sequence of multimodal observations. Then, state recognition is achieved by our new multimodal sequence-based scene matching that integrates multimodal features obtained from each individual frame and fuses temporal information contained in the whole sequence. Then, we introduce a framework to integrate the sequence-based multimodal state perception with a reinforcement learning method to achieve apprenticeship learning. We evaluate the proposed SMAL approach in challenging search and rescue applications, as we believe our new AL paradigm has potential to address several critical tasks such as victim search and path planning.
The main contributions of this paper are twofold. First, we propose a novel representation of world states, and introduce an approach to recognize the states by simultaneously fusing temporal information and multimodal data. Second, we develop the SMAL approach that integrates multisensory robot perception and decision making to learn tasks from human experts in challenging environments with perceptual aliasing (e.g., disaster scenarios).
The rest of this paper is organized as follows. We describe related publications in Section II. In Section III, we propose the sequence-based multimodal state learning. In Section IV, we discuss perception and decision making integration. After presenting experimental results in Section V, we conclude our paper in Section VI.
II Related Work
In this section, we provide a review of AL techniques, and state representation and recognition methods.
II-A Apprenticeship Learning
Apprenticeship learning [6], also known as learning from demonstration (LfD) or imitation learning (IL) has attracted numerous attention in recent decades [12, 13, 14], which allows robots to accomplish tasks autonomously by learning from expert demonstrations without being told explicitly.
Many AL methods were reported in various applications, which fall into two categories: Direct and indirect approaches [12]. Direct approaches directly imitate experts by applying supervised learning to learn policy as a direct mapping from states to motion primitives. In problems with discrete action space, classification methods are used as mapping functions [13, 15, 16, 17]. For example, interactive policy learning was proposed to control a car from demonstrations based on Gaussian mixture models [16]. AL techniques based on k-Nearest Neighbors (kNN) classifiers were implemented to learn obstacle avoidance and navigation [17]. In problems with continuous action space, regression-based methods are typically used as state-action mapping functions [10, 18, 19, 20]. For example, driving actions were learned through mapping input images to actions using neural networks [10]. Robot control policy was also estimated in soccer scenarios using sparse online gaussian processes [20].
Indirect approaches models the interaction between agent and environment as a Markovian decision problem, which select the optimal policy to maximize certain reward. Most methods manually defined the reward function. For example, hand-crafted sparse reward functions was applied for policy synthesis in the task of corridor following in the reinforcement learning framework. Reward functions depending on the swing angle were implemented in a ball-in-a-cup game [14], in which optimal actions were chosen to maximize the accumulated reward. Due to the great challenge to define an effective reward function [5]. Inverse reinforcement learning was proposed to learn optimal reward functions given expert demonstrations [21, 22, 23]. For example, three methods were demonstrated in grid world and mountain-car tasks [9]. An inverse reinforcement learning method was proposed to recover unknown reward functions under MDP framework, which was able to output policy with performance close to that of the expert [6].
However, most previous studies assume the state space is known a priori, which still require at least partially manual construction of state space. To address this issue, we propose a state learning method to automatically construct state space from multimodal sequential observations provided in expert demonstrations.
II-B State Representation and Recognition
As our objective is to integrate decision making and robot perception that applies onboard sensors to perceive the world state, this review will focus on methods that represent world states based on raw data directly acquired by optical cameras, which have become a standard sensor in modern robots.
Representation: Many techniques have been implemented to characterize and represent world states from image data based on features. Local and global features are two main categories for visual state representation [24]. Local features describe local information in a part of an image, including SIFT [25], ORB [26], etc. Such techniques apply a detector to identify interest points in an image and extract a feature vector by applying a descriptor around each interest point. Unlike local features, state representations based on global features describe the whole image, which encode its global color, shape, and texture signatures [27]. Examples of global features include LDB to encode intensity and gradient differences of image grid cells, GIST [28] to encode dominant spatial structures, and the recent deep feature to learn image statistics [29].
Recognition: Most of the previous state recognition methods (e.g., scene recognition) are based on individual-image matching, using pairwise similarity scoring [30, 31], nearest neighbor search [32, 33, 34], and sparse optimization [35, 36]. However, it has been demonstrated that state (or scene) recognition based on individual images cannot work well in challenging environments (e.g., with strong perceptual aliasing) [37, 38, 30, 31] and fusing information from a sequence of images is critical to match between states [30].
Different from previous techniques, we propose a unified formulation to simultaneously fuse multiple types of features to represent states and match sequence of multimodal observations for state recognition.
III Sparse Multimodal State Learning
We propose a novel SMAL approach to (1) represent and recognize states based on multimodal observation sequences, and (2) integrate state learning with decision making to guide robot actions (e.g., performing victim search and rescue in disaster areas). This section focuses on contribution (1), and contribution (2) will be detailed in the Section IV.
Notation. In this paper, we represent vectors as boldface lowercase letters, and matrices using boldface, capital letters. Given a matrix , we refer to its -th row and -th column as and , respectively. The -norm of a vector is defined as , and the -norm of is defined as . The -norm of the matrix is defined as:
[TABLE]
III-A Sequence-based Multimodal State Matching
To solve the problem of state identification in challenging real-world environments (e.g., disaster scenarios in search and rescue operations), we propose to incorporate a temporal sequence of observations (e.g., images) for state recognition and fuse multiple heterogenous sensing modalities to capture comprehensive environmental information to address perceptual aliasing.
Assume a set of templates encoding the states (e.g., scenes in victim search) from a target area , and each template contains a set of heterogenous feature modalities , where represents the feature vector of length extracted from the -the feature modality and . Because our method focuses on sequence-based state learning, we group adjacent observations (e.g., camera frames) together as a temporal sequence to encode each state, resulting in the set of sequence-based templates , where denotes the -th sequence that contains images acquired in a short time period, and is the number of sequences in the set satisfying . Then, given a new query sequence containing a set of multimodal observations , we formulate state identification as a learning task to estimate the weight matrix, :
[TABLE]
where denotes the weights of the templates in the -th sequence with respect to the -th query observation in the sequence .
Since individual observations in template and query sequences can be noisy or contain missing values, we propose to constrain each observation in the sequence to only rely on a small number of representative template sequences for state recognition, leading to the regularized sparse optimization problem as follows:
[TABLE]
where the -norm regularization of forces the sparsity of the scene templates used to represent the query scene. Eq. (3) can be rewritten as a more compact matrix expression:
[TABLE]
where .
However, the regularizer of weight matrix in Eq. (4) is an element-wise -norm, which ignores the interrelationship among individual feature modalities within each observation . To encode this interrelationship among individual modalities within an individual observation , we use the -norm as a new regularization:
[TABLE]
The -norm regularization applies an -norm to enforce group effects of all individual modalities in the same individual observation, and uses an -norm to enforce the sparsity among individual observations.
To enable sequence-based state recognition, we propose a new regularization to model the group structure among all sequences. We name it the -norm, because it is a structured -norm encoding the group structure of , as follows:
[TABLE]
The -norm applies the -norm on individual observations within each sequence, and the -norm among sequences. That is, the new -norm not only enforces the observations within the same sequence to have similar weights, but also enforces the sparsity between sequences. For example, if a template sequence is not representative for a query observation , the weights of the individual observations in have small values; otherwise, their weights are large.
Thus, the final optimization problem becomes
[TABLE]
III-B State Space Learning and State Identification
Our previous discussion is based upon the assumption that the state space has been provided during the training phase using expert demonstrations. However, the critical problems of how to construct state space has not been discussed.
To address this problem in the training phase, we introduce a new approach in Algorithm 1 to automatically construct the state space for our sequence-based state recognition. Intuitively, if a query sequence does not match any template sequences within the database, it will be inserted into the database. Formally, after obtaining the optimal weight matrix , given a new sequence during training, we identify its state by matching with all existing template sequences . If the weight of a template sequence satisfies:
[TABLE]
where is a threshold with a small value, then we conclude that does not match the sequence . If does not have any matches in , we add into the template database. This approach ensures that there exists only one representative sequence in the template database to encode the same state (e.g., the same scene with similar viewpoints). If duplicated sequences are provided, our algorithm will ignore them, and the state space will remain the same.
During the execution phase, given the query sequence of multimodal observations obtained by the robot, our SMAL method recognizes its state by solving the following problem:
[TABLE]
where is computed by Algorithm 2.
III-C Optimization Algorithm
Although the optimization problem in Eq. (7) is convex, it is challenging to solve it since there are non-smooth terms in the objective function. Here we provide an efficient algorithm to solve this problem that grantees theoretical convergence.
After taking the derivative of Eq. (7) with respect to and setting it to [math], we have
[TABLE]
where is a diagonal matrix with the -th diagonal element equals , is a diagonal matrix with the -th element as , and is a block diagonal matrix with the -th diagonal block as , where denotes an dimensional identity matrix. For each , we obtain
[TABLE]
Therefore, can be calculated by
[TABLE]
We can observe that the matrices and in Eq. (12) all depend on the weight matrix , which are unknown. To solve this regularized optimization problem, we propose an iterative solver as presented in Algorithm 2. We can prove that Algorithm 2 guarantees the theoretical convergence to the global optimal solution. Detailed analysis and mathematical proof is provided in Appendix.
IV Integration of State Perception and
Decision Making
Beyond the ability to automatically learn states, our SMAL method is also able to integrate state perception and decision making. This integration allows a robot to directly utilize raw multisensory observation sequences to make decisions and take actions, without assuming perfect perception or hand-crafted states that are not practical in complicated real-world environments (e.g., search and rescue scenarios).
We propose to achieve the integration of our state perception with the general Markov decision process (MDP) model, which has been widely employed for robot decision making, to show the generalization of our SMAL method that has the potential to impact various robotics applications using MDP. From the viewpoint of real-world online robot execution, the input data into our integrated model is the raw multimodal observation sequences obtained by sensors equipped on the robot, and the output of our SMAL method is an optimal action learned in response to the state identified by our perception method.
Formally, the integrated perception and decision making model of our SMAL method is represented as a tuple , where denotes a finite set of discrete states; represents a finite set of discrete actions that human/robot can perform to activate state transitions; denotes a discrete transition function representing the probability of a state transition resulted from an action; denotes a mapping from the state-action pair to a scalar, representing the immediate reward received when the robot takes action in state ; and is a reward discount factor. Different from previous MDP-based methods whose states are typically computed at a specific time point and represented by a single modality, our integrated model represents a state based on a sequence of observations with multimodal modalities. This integration is realized using our sequence-based multimodal state recognition method that transfers a multimodal observation sequence into a discrete value , as defined in Eq. (9).
Same as all MDP-based decision making, our integrated model aims to learn a policy that is defined as a mapping from the learned state space to the action space . The value of a policy is given by . Then, the objective of decision making is to find an optimal policy to maximize the value function :
[TABLE]
In the following, we describe our implemented methods to learn other components of the MDP model used in integrated SMAL method, as follows:
Learning Action Space. The action space can be learned based on the kinematic data collected during expert demonstrations. In our experiments, teleoperation command streams provided by humans are recorded and used to learn the action space , where each action consists of a sequence of atom movements. Such atom movements include moving forward, moving backward, turning left, and turning right. Ideally, actions are continuous, but robots perform actions in discrete-time during the execution phase, since the specific optimal action is selected based on the current state, which is discrete and recognized at every frames. The action space is learned by Algorithm 3.
Learning State Transition. The state transition represents the probability that the system will end up in state after taking action in state . The state transition is learned using the state and action streams obtained in Algorithms 1 and 3, respectively. In our implementation, the state transition is learned by Algorithm 4.
Learning Immediate Reward. After the MDP model is learned, we are able to learn the immediate reward provided by the human demonstrations and a predefined . A widely used technique is inverse reinforcement learning. We directly employed the technique in [9], in which reward learning is formulated as a sparse optimization problem since the maximum reward (i.e., finding victims) in our application is achieved at the end state.
V Experiments
To evaluate the performance of our SMAL approach, we performed two sets of experiments in different scenarios to address the application of robot-assisted search and rescue, including (1) urban search and rescue in simulation, and (2) indoor search tasks using real robots. The mission objective for the robot is to find a victim within the environment, who are not directly viewable by the robot.
In our experiments, the (simulated and real) robots employ a camera to perceive the surrounding world; multiple feature modalities are applied to extract information to represent the world. To enable real-time performance, we intentionally use feature modalities that can be extracted efficiently, including low-resolution color features on downsampled images and histogram of oriented gradients features on downsampled images. The visual feature modalities are normalized and concatenated as a multimodal representation of individual observations.
V-A Urban Search and Rescue Simulation
In this set of experiments, we apply the Webots simulator111Webots: https://www.cyberbotics.com. [39] to evaluate our SMAL approach in an urban search and rescue application. The objective is to let a robot learn how to find victims in large urban areas from expert demonstrations. We chose the campus of the Colorado School of Mines as our urban environments. The Google satellite map of this area is shown in Fig. 2(b). We imported the OpenStreetMap222OpenStreetMap: http://www.openstreetmap.org/#map=18/39.74966/-105.22212. of this area into the Webots platform, as illustrated in Fig. 2(a). The robot and victim models we built in the Webots platform are shown in Fig. 2(a). The two-wheel mobile robot, named Rescuebot, equips with a color camera with a resolution. In addition, we are able to obtain the accurate Rescuebot’s location and rotation information from the simulator, which is used as the ground truth to evaluate state recognition. The victim is lying on the ground without any movement during the entire simulation period, waiting for a robot to find him.
During the training process, we teleoperated the Rescuebot to approach the target victim using keyboards as the expert demonstration. The image sequences obtained by the Rescuebot and the keyboard teleoperation commands were recorded to train our SMAL method. After training was completed, the Rescuebot was able to automatically execute search operations using the learned model in the testing phase.
To qualitatively evaluate the experimental results, an example route that the Rescuebot successfully finds the victim in the execution phase is presented in Fig. 2(c). It demonstrates that, although the Rescuebot cannot see the victim directly, the robot is still able to move and search around to locate the victim. This qualitative result demonstrates that our SMAL method enables robots to learn how to autonomously search for victims in urban search and rescue scenarios.
In addition, we perform quantitative validation using the precision-recall curve as a metric to evaluate the performance of state recognition, as shown in Fig. 3(a) (curves closer to the top right corner indicating a better performance). We also compared the SMAL approach to the baseline method based on individual images with the same modalities, which is demonstrated in Fig. 3(a). It is observed that our SMAL method for sequence-based state recognition outperforms the baseline method using individual images.
We also evaluate the efficiency of our methods for state recognition through studying the value of objective function iteratively updated by Algorithm 2. The result, presented in Fig. 3(b), indicates the algorithm converges in 9 iterations (in general, it converges within 20 iterations with the value below , which demonstrates the algorithm efficiency to solve the formulated regularized optimization problem.
V-B Indoor Search and Rescue using Real TurtleBot
In this set of experiments, we evaluate our SMAL method to teach robots to perform victim search in indoor scenarios. A real TurtleBot II robot is used to evaluate the performance of our system. The objective is to teach the TurtleBot about how to find victims (in this experiment, a NAO humanoid robot) in the room using expert demonstrations. The setup of the indoor search area is presented in Fig. 4(a). We also install an overhead camera above this area to collect the ground truth of robot location and orientation for evaluation only by tracking the ARTag attached on top of the Turtlebot.
In the training phase, we teleoperated the TurtleBot using keyboards as demonstrations to let it approach the Nao robot. The observation obtained by the TurtleBot and the keyboard teleoperation commands were recorded to train our SMAL model. After that, during the execution phase, the TurtleBot executed the search task based on the learned model to find the NAO robot. A challenge of this real-world experiment in comparison to simulation is that the TurtleBot often shook when moving, making the captured observations unstable, which can decrease the accuracy of state recognition.
The qualitative experimental results are illustrated in Fig. 4(b), which indicates even the Turtlebot cannot directly see the victim (i.e., the NAO robot in this set of experiments), but it can still navigate around multiple obstacles to find the victim. This demonstrates the effectiveness of our SMAL approach to teach robots about how to search victims in a real indoor environment. We also quantitatively evaluate our method’s performance using precession-recall curves and compare SMAL with the baseline method using image matching. The results are presented in Fig. 5(a), which shows our approach significantly outperforms the baseline method. The efficiency of our SMAL approach is proved in Fig. 5(b), which shows the algorithm converges after 12 iterations.
V-C Parameter Analysis
We analyze the effects of various parameter values on our SMAL approach in real-world indoor search tasks using real TurtleBots.
The sequence length for state recognition is the most important parameter. The precision-recall curves in Fig. 6(a) indicate that better performance can be obtained when we increase the sequence length. That is because long sequences can provide more comprehensive information than short sequences. When , a sequence becomes a single image. In addition, we use success rate as a metric to evaluate the percentage that the robot can successfully find victims without hitting obstacles. The results are demonstrated in Fig. 6(b), where 10 executions are used in each case to calculate the success rate. It is observed that when the used sequence is short, the poor perception result negatively affects decision making, resulting in the low success rate. However, longer sequences do not necessarily result in higher success rates. This is because as we increase the sequence length, although each sequence can contain more information, the frequency in which the robot receives observations decreases. This can dramatically decrease the succuss rate, since the information does not come in time for robot control.
VI Conclusion
We propose a novel sequence-based multimodal apprenticeship learning approach that can automatically learn and identify world states, and integrates perception and decision making. The SMAL approach represents each state as a sequence of multimodal observations by simultaneously fusing temporal information and multimodal data. The SMAL approach also integrates robot perception and decision making to learn tasks from human demonstrations to enable effective robot actions in challenging environments with perceptual aliasing. To evaluate the performance of the SMAL method, experiments using both simulations and real-world robots are performed in the challenging search and rescue applications. Qualitative results have validated that our method is able to guide autonomous robots to successfully finish the search and rescue task. In addition, quantitative evaluation results have demonstrated that our SMAL method outperforms baseline methods based on individual images to find victims in the challenging search and rescue applications.
Appendix A Convergence Analysis of Algorithm 2
Theorem 1
Algorithm 2 decreases the objective value of the problem in Eq. (7) in each iteration.
The following lemma [40] is used to prove Theorem 1.
Lemma 1
For any nonzero vector and , the following inequality holds: .
Then we are ready to prove the convergence of Algorithm 2, which is represented by Theorem 1.
Proof:
We denote the update of is . According to Step 6 in Algorithm 2, we have:
[TABLE]
Thus, we can obtain
[TABLE]
We are able to derive the following inequalities according to the definition of , , and :
[TABLE]
According to Lemma 1, we obtain the inequalities:
[TABLE]
[TABLE]
[TABLE]
After computing the summation of the three equations in Eq. (A) on both sides (weighted by s), we obtain:
[TABLE]
Thus, we conclude that Algorithm 2 decreases the objective value monotonically during each iteration. Because Eq. (7) is a convex optimization function, Algorithm 2 converges to the global optimal solution. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. D. Sweeney and R. Grupen, “A model of shared grasp affordances from demonstration,” in Humanoid , 2007.
- 2[2] J. Chen and A. Zelinsky, “Programing by demonstration: Coping with suboptimal teaching actions,” IJRR , vol. 22, no. 5, pp. 299–319, 2003.
- 3[3] P. Abbeel, A. Coates, and A. Y. Ng, “Autonomous helicopter aerobatics through apprenticeship learning,” IJRR , 2010.
- 4[4] W. D. Smart, “Making reinforcement learning work on real robots,” Ph.D. dissertation, Brown University, 2002.
- 5[5] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” RAS , vol. 57, pp. 469–483, 2009.
- 6[6] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in ICML , 2004.
- 7[7] J. A. Bagnell and J. G. Schneider, “Autonomous helicopter control using reinforcement learning policy search methods,” in ICRA , 2001.
- 8[8] R. Amit and M. Matari, “Learning movement sequences from demonstration,” in ICDL , 2002.
