TL;DR
This paper introduces a self-supervised learning approach to create compact multimodal sensory representations that enhance the efficiency and robustness of contact-rich manipulation policies in unstructured environments, demonstrated on peg insertion tasks.
Contribution
It proposes a novel self-supervised multimodal representation learning method that improves sample efficiency and generalization in robot manipulation tasks involving vision and touch.
Findings
The method generalizes across different geometries and configurations.
It is robust to external perturbations.
Effective in both simulation and real-world experiments.
Abstract
Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. It is non-trivial to manually design a robot controller that combines these modalities which have very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. In this work, we use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. Evaluating our method on a peg insertion task, we show that it generalizes over varying geometries, configurations, and clearances, while being robust to external perturbations. We also systematically study different self-supervised learning objectives and…
| Loss | FM | Det | NP | 256 | 64 | 16 |
|---|---|---|---|---|---|---|
| Optical Flow | 1.63 | 1.79 | 1.26 | 1.27 | 1.27 | 1.81 |
| End-Effector Pose | 0.75 | 0.91 | 0.56 | 0.32 | 0.31 | 1.96 |
| Contact | 13.84 | 312.11 | 0.68 | 0.87 | 0.87 | 2.51 |
| Pairing | 4.99 | 2,514.26 | N/A | 26.00 | 11.31 | 32.29 |
| All Models | |
|---|---|
| Batchsize | 64 |
| Learning Rate () | 1.00E-04 |
| Adam Beta1 () | 0.5 |
| Adam Beta2 () | 0.999 |
| Dataset | Full Model | Deterministic | No Pairing | d-dim 256 | d-dim 64 | d-dim 16 | |
|---|---|---|---|---|---|---|---|
| Optical Flow Loss | Train | 0.018 | 0.016 | 0.020 | 0.019 | 0.019 | 0.017 |
| Test | 0.029 | 0.028 | 0.025 | 0.024 | 0.024 | 0.031 | |
| End-Effector Pose Loss | Train | 8.10E-03 | 1.76E-04 | 1.31E-02 | 1.66E-02 | 1.66E-02 | 1.56E-02 |
| Test | 6.07E-03 | 1.60E-04 | 7.25E-03 | 5.34E-03 | 5.12E-03 | 3.05E-02 | |
| Contact Loss | Train | 0.033 | 0.002 | 0.095 | 0.099 | 0.099 | 0.032 |
| Test | 0.459 | 0.657 | 0.065 | 0.086 | 0.086 | 0.080 | |
| Contact Accuracy | Train | 98.4% | 100.0% | 98.4% | 98.4% | 96.9% | 95.3% |
| Test | 98.9% | 99.9% | 99.2% | 99.5% | 97.8% | 98.6% | |
| Pairing Loss | Train | 0.221 | 3.72E-04 | N/A | 0.074 | 0.196 | 0.061 |
| Test | 1.102 | 0.934 | N/A | 1.916 | 2.221 | 1.974 | |
| Pairing Accuracy | Train | 94.5% | 100.0% | N/A | 82.0% | 96.9% | 98.4% |
| Test | 90.6% | 96.0% | N/A | 58.8% | 88.2% | 87.7% |
| Simulation | Real Robot | |
| Episode Length | 500 | 1000 |
| Batchsize | 2000 | 3000 |
| GAE Lamba () | 0.97 | 0.97, 0.98* |
| GAE Gamma () | 0.995 | 0.995, 0.99* |
| Max KL | 1E-02 | 1E-02 |
| Damping Coefficient | 1E-01 | 1E-01 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks
Michelle A. Lee, Yuke Zhu, Peter Zachares, Matthew Tan, Krishnan Srinivasan, Silvio Savarese, Li Fei-Fei, Animesh Garg, Jeannette Bohg Authors are with the Department of Computer Science, Stanford University. [mishlee,yukez,zachares,mratan,krshna,ssilvio,feifeili, animeshg,bohg]@stanford.edu. A. Garg is also at Nvidia, USA.
(\formatdate112001)
Abstract
Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. It is non-trivial to manually design a robot controller that combines these modalities which have very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. In this work, we use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. Evaluating our method on a peg insertion task, we show that it generalizes over varying geometries, configurations, and clearances, while being robust to external perturbations. We also systematically study different self-supervised learning objectives and representation learning architectures. Results are presented in simulation and on a physical robot.
Index Terms:
Deep Learning in Robotics and Automation, Perception for Grasping and Manipulation, Sensor Fusion, Sensor-based Control
I introduction
Even in routine tasks such as inserting a car key into the ignition, humans seamlessly combine the senses of vision and touch to complete the task. Visual feedback provides semantic and geometric object properties for accurate reaching or grasp pre-shaping. Haptic feedback provides observations of current contact conditions between object and environment for accurate localization and control under occlusions. These two types of feedback modalities are complementary and concurrent during contact-rich manipulation [8], which is illustrated in Fig. 1. Yet few algorithms endow robots with a similar capability. While the utility of multimodal data has frequently been shown in robotics [7, 54, 66, 58, 20], the proposed manipulation strategies often rely on handcrafted features or prior knowledge about how to solve a task. This makes many of these methods task-specific. On the other hand, most learning-based methods do not require manual task specification, yet the majority of learned manipulation policies close the control loop around a single modality, often RGB images [41, 15, 22, 72].
In this work, we equip a robot with a policy that leverages multimodal feedback from vision and touch – modalities that differ in many characteristics such as dimensionality, sampling frequency, and value range. Our proposed policy is learned through self-supervision and generalizes over variations of the same contact-rich manipulation task in geometries, configurations, and clearances. As a case study, we use the task of peg insertion. We qualitatively show that the learned policy is also robust to some external perturbations. Our approach starts with using neural networks to learn a joint representation of haptic, RGB-D as well as proprioceptive information. Using self-supervised learning objectives, this network is trained to predict optical flow, whether contact will be made in the next control cycle, future end-effector position, and concurrency of visual and haptic data. The training is action-conditional, encouraging the encoding of action-related information. The resulting compact representation of the high-dimensional and heterogeneous data forms the input to a policy for contact-rich manipulation tasks (in this paper peg insertion) that is learned using deep reinforcement learning. The proposed decoupling of state estimation and control achieves practical sample efficiency for learning both representation and policy on a real robot.
Our primary contributions are:
A variational model for multimodal representation learning from which a contact-rich manipulation policy can be learned. 2. 2.
Demonstration of peg insertion tasks that effectively utilize both haptic and visual feedback for hole search, peg alignment, and insertion (see Fig 1). Ablation studies comparing the effects of each modality on task performance. 3. 3.
Evaluation of generalization to tasks with different peg geometry and of robustness to perturbation and sensor noise.
This work is an extended version of a previously published conference paper [39]. We propose a new variational representation learning technique and significantly expand the experimental evaluation of the overall methodology in the following ways:
Analysis of our multimodal representation model compared to baseline models with different learning objectives, architecture, and dimension of the representation. 2. 5.
Addition of depth as input modality, and addition of end effector roll to action space which makes the peg insertion task more challenging and increases the dimensionality of the action space from 3-DoF to 4-DoF. 3. 6.
Reproduction of results on a new robot, the Franka Panda.
II Related Work and Background
II-A Contact-Rich Manipulation
Contact-rich tasks, such as peg insertion, fastening screws, and edge following, have been studied for decades due to their relevance in manufacturing. Manipulation policies often rely entirely on haptic feedback and force control, and assume sufficiently accurate state estimation [68]. They typically generalize over certain task variations, for instance, peg-in-chamfered-hole insertion policies that work independently of peg diameter [67]. However, entirely new policies are required for new geometries. For chamferless holes, manually defining a small set of viable contact configurations has been successful [12] but cannot accommodate the vast range of real-world variations. [58] combine visual and haptic data for inserting two planar pegs with more complex cross sections, but assume known peg geometry.
Reinforcement learning approaches have recently been proposed to address variations in geometry and configuration for manipulation. [41, 72] train neural network policies using RGB images and proprioceptive feedback. Their approach works well in a wide range of tasks, but the large object clearances compared to manufacturing tasks may explain the sufficiency of RGB data. A series of learning-based approaches have relied on haptic feedback for manipulation. Many of them are concerned with estimating the stability of a grasp before lifting an object [14, 6], even suggesting a regrasp [60]. Only a few approaches learn entire manipulation policies through reinforcement only given haptic feedback [30, 61, 29, 65, 62, 63]. While [30] relies on raw force-torque feedback, [61, 29, 62] learn a low-dimensional representation of high-dimensional tactile data before learning a policy, and [63] learns a dynamics model of the tactile feedback in a latent space.
Even fewer approaches exploit the complementary nature of vision and touch. Some of them extend their previous work on grasp stability estimation [5, 13]. Others perform full manipulation tasks based on multiple input modalities [31, 20, 1] but require a pre-specified manipulation graph [31], demonstrate only on one task [31, 20], or require human demonstration and object CAD models [1]. There have been promising works that train manipulation policies in simulation and transfer them to a real robot [3, 50, 10]. However, only few works focused on contact-rich tasks [24] and none relied on haptic feedback in simulation, most likely because of the lack of fidelity of contact simulation and collision modeling for articulated rigid-body systems [25, 21].
II-B Representation Learning for Policy Learning
The aim of representation learning is to discover a low-dimensional feature representation of high-dimensional data that captures the information that is relevant for a particular task. In the context of reinforcement learning (which we go into more detail in Sec. III), a good representation encodes the essential information of the state for the agent to choose its next action for a given task [40]. A compact and low-dimensional state representation can make reinforcement learning more data efficient.
A popular representation learning objective is reconstruction of the raw sensory input through variational autoencoders [11, 40, 29, 70], which we consider as a baseline in this work. This unsupervised objective benefits learning stability and speed, but it is also data intensive and prone to overfitting [11]. When learning for control, action-conditional predictive representations can encourage the state representations to capture action-relevant information [40]. There are studies that attempt to predict full images when pushing objects with benign success [2, 4, 46]. In these cases either the underlying dynamics is deterministic [46], or the control runs at a low frequency [22]. In contrast, we operate with haptic feedback at 1kHz and send Cartesian control commands at 20Hz. We use an action-conditional surrogate objective for predicting optical flow, end-effector poses, and contact events with self-supervision.
For a detailed survey of different loss functions, model architectures and training methods for representation learning, we refer to [11, 40].
II-C Multimodal Learning
The complementary nature of heterogeneous sensor modalities has previously been explored for inference and decision making. In this section, we review works that include such diverse modalities as vision, range, audio, haptic and proprioceptive data as well as language. This heterogeneous data makes the application of hand-designed features and sensor fusion extremely challenging. That is why learning-based methods have been on the forefront. For example, there have been many works that have explored the correlation between auditory and visual data for tasks such as speech or material recognition or for sound source localization [45, 48, 49, 70]. [13, 26, 5, 57] fuse visual and haptic data for grasp stability assessment, manipulation, material recognition, or object categorization. [43, 61] fuse vision and range sensing and [61] add language labels. While many of these multimodal approaches are trained through a classification objective [13, 26, 5, 70], in this paper we are interested in multimodal representation learning for control.
There is compelling evidence that the interdependence and concurrency of different sensory streams aid perception and manipulation [18, 36, 9]. Several works have combined visual and tactile information for state estimation with probabilistic inference models such as recursive Bayesian filters [44] and factor graphs [37, 71], but these methods require pre-defined visual features from visual motion trackers or patterned objects, as well as some prior knowledge of the manipulated objects, such as geometric constraints.
In contrast, few studies have explicitly exploited this concurrency in representation learning. Examples include [59] for visual prediction tasks and [45, 48] for audio-visual coupling. In this paper, we follow [69] who demonstrated the advantages of combining multiple, concurrent modalities into a latent space by a product-of-experts approach. Through this approach, fewer parameters and data is needed for multimodal image analysis and language translation tasks. Similar to [48], we also adopt a self-supervised objective to fuse visual and haptic data by predicting whether visual and haptic data are temporally aligned.
III Problem Statement and Method Overview
Our goal is to learn a policy on a robot for performing contact-rich manipulation tasks. We want to evaluate the value of combining multisensory information and the ability to transfer multimodal representations across tasks. We systematically study different representation learning techniques and losses.
For sample efficiency, we first learn a neural network-based feature representation of the multisensory data. The resulting compact feature vector serves as input to a neural network policy trained through deep reinforcement learning.
We model the manipulation task as a finite-horizon, discounted Markov Decision Process (MDP) , with a state space , an action space , state transition dynamics , an initial state distribution , a reward function , horizon , and discount factor . To determine the optimal stochastic policy , we want to maximize the expected discounted reward
[TABLE]
We parameterize the policy with a neural network that are learned as described in Sec. VI. is defined by the low-dimensional latent space representation learned from high-dimensional 2D and 3D visual data and from haptic data. This representation is a neural network parameterized by and and is trained as described in Sec. IV. We refer to this learned latent representation as in the rest of the paper. is defined over continuously-valued, 3D position displacements and roll angle displacement in end-effector space. The controller design is detailed in Sec. VI.
IV Multi-Modal Representation Model
Deep networks are powerful tools to learn representations from high-dimensional data [38] but require a substantial amount of training data. Here, we address the challenge of seeking sources of supervision that do not rely on laborious human annotation. We design a set of predictive tasks that are suitable for learning a fused representation of visual and haptic data for contact-rich manipulation tasks, where supervision can be obtained via automatic procedures rather than manual labeling. We extend our previous work [39] by using a variational model for representation learning instead of a deterministic model. We show how this yields significantly better manipulation policies. Figure 2 visualizes the proposed representation learning model, which uses neural network encoders to learn features from raw sensory inputs and neural network decoders to predict our self-supervised objectives.
IV-A Variational Inference for Representation Learning
We view representation learning from the perspective of a probabilistic graphical model, where we aim to learn , the posterior distribution of the latent variable given the dataset , which has input sensor readings , self-supervised labels , and robot actions .
Unfortunately, computing the true posterior is intractable as it would require to marginalize over all possible for computing the evidence . Instead, we use Variational Bayes [34] to find an approximate posterior that is as close as possible to the true posterior . We can measure how closely approximates with the Kullback-Leibler (KL) divergence: . This still requires to compute the intractable marginal likelihood . We can rewrite the evidence as:
[TABLE]
where the KL divergence term is always positive from Jensen’s inequality. Therefore, the second term forms the Evidence Lower BOund (ELBO) and maximizing this bound leads to minimizing the KL divergence term.
We assume that each data point maps to a unique such that the ELBO is a sum over a term per data point defined as:
[TABLE]
The first term (Eqn. 3) refers to the expected log-likelihood of the -th data point given the latent variable . We model the likelihood with a decoder neural network, parameterized by . The second term (Eqn. 4) can be seen as a regularization term, where we fit the posterior estimate to a prior by minimizing the KL-divergence between the two distributions. We model as a neural network encoder parameterized by . We assume that the prior has a standard isotropic multivariate Gaussian distribution . This prior has the effect that posterior estimates which diverge from a standard normal distribution incur a large penalty. To learn the parameters of the encoder and decoder networks, we minimize the negative ELBO with respect to and . This is equivalent to maximizing Eqn 2, the log-likelihood of the data, with respect to the network parameters.
Our variational inference framework is similar to the widely used Variational Autoencoders (VAE) [34] with one key difference: our decoders do not reconstruct the input data. Instead, the decoders are optimizing self-supervised learning objectives such as making action-conditional predictions over time. Details of our predictive tasks and and loss functions can be found in Sec. IV-D. In Sec. VII, we will show that these predictive objectives lead to better performances than reconstruction as a representation learning objective.
IV-B Modality Encoders
Our model encodes four types of sensory data available to the robot: RGB () and depth images () from a fixed RGB-D camera, haptic feedback from a wrist-mounted force-torque (F/T) sensor () , and proprioceptive data from the joint encoders of the robot arm (). The heterogeneous nature of this data requires domain-specific encoders to capture the unique characteristics of each modality, which we will fuse into a single latent representation vector of dimension . As we are using a variational inference approach that fits the encoder’s posterior distribution to an isotropic multivariate Gaussian prior, each encoder outputs a mean and variance for each of dimension .
For visual feedback, we use a six-layer convolutional neural network (CNN) similar to FlowNet [23] to encode RGB images. For depth feedback, we use an eighteen-layer CNN with convolutional filters of increasing depths similar to VGG-16 [56] to encode depth images. We add a single fully-connected layer to the end of both the depth and RGB encoders to transform the final activation maps into a -dimensional variational parameter vector. For haptic feedback, we take the last 32 readings from the six-axis F/T sensor as a time series and perform 5-layer causal convolutions [47] with stride 2 to transform the force readings into a -dimensional variational parameter vector. For proprioception, we encode the current position, roll, linear velocity and roll angular velocity of the end-effector with a 4-layer multilayer perceptron (MLP) to produce a -dimensional variational parameter vector. In the next section, we discuss how the resulting four vectors that represent each modality are fused into one vector containing the -dimensional variational parameters for the latent space distribution. As default, we set to equal 128 dimensions. We analyze the sensitivity of our method to the dimensionality of the latent space representation in Sec. VIII-A3.
IV-C Multimodal Fusion
Following [69], we combine the estimated distributions of each modality using product of experts. We assume that each modality is conditionally independent given the latent variable representation and that each encoder maps to a multivariate isotropic Gaussian. With these assumptions, we can combine the modality-specific distributions by taking the normalized product of each Gaussian probability density function. The resulting multivariate Gaussian distribution of the multimodal latent space will have mean and variance computed as:
[TABLE]
where is the number of modalities, and are the variational parameters of the -th dimension of the encoder’s posterior distribution, is the variance and is the mean of the -th dimension of the posterior distribution of the -th modality. When training the representation model, we sample from the distribution represented by these variational parameters.
IV-D Self-Supervised Predictions and Decoder Architecture
Representations that encode dynamics and action-related information have been shown to work well for policy learning [40]. To achieve this, we design four action-conditional representation learning objectives. Given the next robot action and the compact representation of the current sensory data, the model has to predict (i) the optical flow in the image sequence generated by the action, (ii) the optical flow mask which is also used in (i), (iii) whether the end-effector will make contact with the environment in the next control cycle, and (iv) the future end-effector position. Ground-truth optical flow annotations are automatically generated given proprioception and known robot kinematics and geometry [23, 27]. From the ground-truth optical flow annotations we also extract the optical flow mask, which can be seen as the segmentation mask of the robot in motion. Ground-truth annotations of binary contact states are generated by applying simple heuristics which check whether the F/T readings on the wrist sensor are above certain empirically determined thresholds.
The next action, i.e. the end-effector motion, is encoded by a 2-layer MLP. The output of the action encoder is concatenated with the multimodal representation and processed by an additional 2-layer MLP, which is used as the input to the decoders. As an additional source of self-supervision not included in [39], we are also predicting action-conditional end-effector positions, which can be non-trivial to model due to errors in our dynamics model, our spline-based trajectory generator (discussed in Sec. VI), and non-linear contact dynamics.
The flow predictor uses a 4-layer convolutional decoder with upsampling to process the action-conditional feature vector. Following [23], we use 4 skip connections. At the end of these 4 layers, one convolution layer predicts the unmasked optical flow of the scene and another convolution layer predicts the optical flow mask. These two estimates are multiplied element-wise to predict the optical flow of the robot. The predicted optical flow is a image which is then upsampled to the size of the ground-truth optical flow . The contact predictor is a 1-layer MLP and performs binary classification. The end-effector prediction network is a 4-layer MLP.
As discussed in Sec. II-B, there is concurrency between the different sensory streams leading to correlations and redundancy, e.g., seeing the peg, touching the box, and feeling the force. We exploit this by introducing a fifth representation learning objective that predicts whether two sensor streams are temporally aligned [48]. During training, we sample a mix of time-aligned multimodal data and randomly shifted ones which have the opposite contact state of time aligned data. The alignment predictor (a 1-layer MLP) takes the low-dimensional multimodal representation as input and performs binary classification of whether the input was aligned or not.
IV-E Loss Functions and Training Details
For binary classification prediction tasks, we model the likelihood distribution as a Bernoulli distribution. This allows us to use a cross entropy loss to minimize the negative log-likelihood (Eqn. 3) in the negative ELBO . For predictions with continuous values, we model the likelihood distribution with multivariate Gaussians, and use mean squared error loss functions.
We train the action-conditional optical flow with endpoint error (EPE) loss averaged over all pixels [23], end-effector position prediction with mean squared error loss, and the contact prediction, the alignment prediction, and optical flow mask prediction with cross-entropy loss. Along with minimizing the negative log-likelihood with these five losses (Eqn. 3), we also want to minimize the KL divergence (Eqn. 4) between the approximate posterior and the prior. This gives six loss terms.
During training, we minimize the sum of the six losses described above end-to-end with stochastic gradient descent on a dataset of rolled-out trajectories. In order to backpropagate through the random variables in the proposed probabilistic network, we employ the reparametrization trick commonly used with variational inference methods described in [34]. Once trained, this network produces a -dimensional feature vector that compactly represents multimodal data. This vector is used as the input to the manipulation policy learned via reinforcement learning. Our data collection procedure is described in more detail in Sec. VII.
V Baseline Representation Models
In addition to our variational self-supervised multimodal representation model (referred to as Full Model, we also propose two representation learning baselines for comparison.
V-A Deterministic Model
The Deterministic model is based after the model proposed in our previous work [39], which does not use a probabilistic graphical model framework. Instead we are using deterministic encoders to learn the representation and deterministic decoders to predict the same self-supervised objectives. Each modality encoder outputs a feature vector with dimensions, where . We concatenate the four feature vectors from each modality, and pass the resulting vector through a 2-layer MLP to produce the final -dimensional multimodal representation. The decoder architectures remain the same as the Full Model, and we use the same self-supervised prediction objectives (and corresponding losses).
V-B Reconstruction Model
The Reconstruction model does not use our proposed self-supervised learning objectives, but instead uses unsupervised learning to reconstruct the input modalities. In other words, it is a variational autoencoder [34] using the same product of experts [69] assumption and associated equations to combine the output parameters from the individual modality encoders. The model is trained to reconstruct the RGB image (vision), force reading (force) and end-effector pose and velocities (proprioception) inputted into the model. The loss function used to measure the error in the reconstruction for all three modalities is the mean squared error between the estimated values and the ground-truth values.
The vision reconstruction decoder consists of 6 2D convolutional layers with upsampling, with the final hidden layer transformed with sigmoid activation. The proprioception decoder is composed of a 4-layer MLP. While the force modality’s input to the network is a time series of force readings, the force decoder only estimates the force reading at the final timestep, instead of reconstructing the full time series array. Lastly, we use a 4-layer MLP to estimate the 6-dimensional force reading.
VI Policy Learning and Controller Design
Our final goal is to equip a robot with a policy for performing contact-rich manipulation tasks that leverages multimodal feedback. Though it is possible to engineer controllers for specific instances of these tasks [68, 58], this effort is difficult to scale to the large variability of real-world tasks. Therefore, it is desirable to enable a robot to learn control policies through trial-and-error, where the learning process is applicable to a broad range of tasks. In this work, we use a peg insertion task with different geometries as our evaluation task.
Given its recent success in continuous control [42, 55], deep reinforcement learning lends itself well to learning policies that map high-dimensional features to control commands.
Policy Learning. Modeling contact interactions and multi-contact planning still result in complex optimization problems [52, 51, 64] that remain sensitive to inaccurate actuation and state estimation. We formulate contact-rich manipulation as a model-free reinforcement learning problem to investigate its performance when relying on multimodal feedback and when acting under uncertainty (such as uncertain geometry, clearance, and configuration for our peg insertion task). in By choosing model-free, we also eliminate the need for an accurate dynamics model, which is typically difficult to obtain in the presence of rich contacts. Specifically, we choose trust-region policy optimization (TRPO), which is a policy gradient method [55]. TRPO imposes a bound of KL-divergence for each policy update by solving a constrained optimization problem, which prevents the policy from moving too far away from the previous step. The policy network is a 2-layer MLP that takes as input the -dimensional multimodal representation and produces 3D position displacement and 1D orientation displacement of the robot end-effector. To train the policy efficiently, we freeze the representation model parameters during policy learning, such that it reduces the number of learnable parameters to of the entire model and substantially improves the sample efficiency.
Controller Design. We define the 6-DoF pose of the end-effector as consisting of end-effector position and end-effector orientation . Assuming the Euler Angle representation of rotation, we can define end-effector rotation around the fixed unit vectors of the global frame as . In this work, we control the 3D position and the -axis roll rotation of the end-effector (but do not control the -axis yaw rotation and y-axis pitch rotation ).
Our controller takes as input Cartesian end-effector position displacements and roll angle displacements from the policy at 20Hz, and outputs direct torque commands to the robot at 500Hz. The controller architecture can be split into three parts: trajectory generation, impedance control and operational space control (see Fig 3). Our policy outputs end-effector control commands instead of joint-space commands, so it does not need to implicitly learn the non-linear and redundant mapping between 7-DoF joint space and 4-DoF end-effector space. We use direct torque control as it gives our robot compliance during contact, which makes the robot safer to itself, its environment, and any nearby human operator. In addition, compliance makes the peg insertion task easier to accomplish under position uncertainty, as the robot can slide on the surface of the box while pushing downwards [30, 53, 19].
The trajectory generator bridges low-bandwidth output of the policy (limited by the forward pass of our representation model), and the high-bandwidth torque control of the robot. Given and from the policy (and initial yaw angle and pitch angle ), we can construct a 6D end-effector displacement pose . With a current 6D end-effector pose , we calculate the desired end-effector pose . The trajectory generator interpolates between and to yield a trajectory of end-effector pose, velocity and acceleration at 500Hz. This forms the input to a PD impedance controller to compute a task space acceleration command: where and are manually tuned gains.
By leveraging known kinematic and dynamics models of the robot, we can calculate joint torques from Cartesian space accelerations with the dynamically-consistent operational space formulation [32]. We compute the force at the end-effector with , where is the inertial matrix in the end-effector frame that decouples the end-effector motions. Finally, we map from to joint torque commands with the end-effector Jacobian , which is a function of joint angle : .
VII Experiments: Design and Setup
The primary goal of our experiments is to examine the effectiveness of the multimodal representations in contact-rich manipulation tasks. In particular, we design the experiments to answer the following five questions: 1) What is the value of using all instead of a subset of modalities? 2) What representation learning loss functions help policy learning? 3) How compact can the latent representation be for policy learning? 4) Is policy learning on the real robot practical with a learned representation? 5) Does the learned representation generalize over task variations and recover from perturbations?
Task Setup. We design a set of peg insertion tasks where task success requires joint reasoning over visual and haptic feedback. We use four different types of pegs and holes fabricated with a 3D printer: square peg, triangular peg, semicircular peg, and hexagonal peg, each with a nominal clearance of around 2mm as shown in Fig. 6.
Robot Environment Setup.
In simulation, we use the Kuka LBR IIWA robot, a 7-DoF torque-controlled robot. In our previous work [39], we have used the same robot for real world experiments. Here, we use the Franka Panda robot (also with 7-DoF, torque-controlled) to emphasize that the results reported in [39] are reproducible on different hardware. Four sensor modalities are available in both simulation and real hardware, including proprioception, an RGB-D camera, and a force-torque sensor. The proprioceptive input is the end-effector pose as well as linear and angular velocity. They are computed using forward kinematics. RGB images and depth maps are recorded from a fixed camera pointed at the robot. Input images to our model are down-sampled to . On the real robot, we use the Kinect v2 camera. In simulation, we use CHAI3D [16] for rendering images and robot meshes for rendering depths [27]. The force sensor provides 6-axis feedback on the forces and moments along the x, y, z axes. On the real robot, we mount an OptoForce sensor between the last joint and the peg. In simulation, the contact between the peg and the box is modeled with SAI 2.0 [17], a real-time physics simulator for rigid articulated bodies with high fidelity contact resolution.
Reward Design. We use the following staged reward function to guide the reinforcement learning algorithm through the different sub-tasks, simplifying the challenge of exploration and improving learning efficiency:
[TABLE]
where denotes the peg’s current relative position to the peg hole and is the current relative orientation along the z axis of the peg in relation to the peg hole, is a constant factor to scale the input to the function. The target peg position is with as the height of the hole, and and are constant scale factors.
Evaluation Metrics. We report the quantitative performance of the policies using the sum of rewards achieved in an episode, normalized by the highest attainable reward. We also provide statistics on the stages of the peg insertion task that each policy can achieve, and report the percentage of evaluation episodes in the following four categories:
completed insertion: peg reaches bottom of the hole; 2. 2.
inserted into hole: peg goes into the hole but has not reached the bottom; 3. 3.
touched the box: peg only makes contact with the box; 4. 4.
failed: peg fails to reach the box.
Implementation Details. To train each representation model, we collect a multimodal dataset of 150k states on a real robot and 600k in simulation. We generate the self-supervised annotations offline. We roll out a random policy as well as a heuristic policy while collecting the data, which encourages the peg to make contact with the box. As the policy runs at 20 Hz, 100k data takes around 90 minutes to collect. The representation models are trained for 20 epochs on a Quadro P5000 GPU before starting policy learning. Details of how we train our representation and policy can be found in Appendix A and Appendix B respectively.
VIII Experiments: Results
We first conduct an ablative study in simulation to investigate the contributions of individual sensory modalities, representation learning techniques, and latent space dimensionality to learning the multimodal representation and manipulation policy. We then apply our full multimodal model to a real robot, and train policies using reinforcement learning for the peg insertion tasks from the learned representations with high sample efficiency.
VIII-A Simulation Experiments
VIII-A1 Multimodal Input Experiments
Four modalities are encoded and fused by our representation model: RGB images, depth images, force readings, and proprioception (see Fig. 2). To investigate the importance of each modality for contact-rich manipulation tasks, we perform an ablative study in simulation where we learn multimodal representations with different combinations of modalities. These learned representations are subsequently fed to the policy networks to train on a task of inserting a square peg. While we use a probabilistic representation during representation learning, we take the mean of the learned representation during policy learning. We randomize the configuration of the box position at the beginning of each episode to enhance the robustness and generalization of the model.
We illustrate the training curves of the TRPO-trained agents in Fig. 4. We train all policies with 2.0k episodes, each lasting 500 steps. We updated the policy networks every four episodes. To evaluate the policies, we chose the best performing checkpoint that was logged during training for each policy based on the training results and performed 50 rollouts on each policy. The results of the evaluation can be seen in Fig. 5.
Our Full Model corresponds to the multimodal representation model introduced in Sec. IV, which takes all four modalities as input. We compare it with three baselines: No RGB masks out the visual input to the network, No Haptics masks out the haptic input, and No Depth masks out the depth input. From Fig. 4 and Fig. 5 we observe that the absence of the RGB images, depth, or force modality negatively affects task completion, with No Depth performing the worst. Among these three baselines, we see that the No RGB baseline achieved the highest rewards, suggesting that a combination of visual data from depth and haptics data from the force sensor gives sufficient information for the peg insertion tasks. None of the three baselines have reached the same level of performance as the final model, which uses all the modalities,
VIII-A2 Representation Learning Model
Our multimodal representation Full Model, as described in Sec. IV, uses variational encoders to predict action-conditional optical flow, contacts, end-effector pose, and time-aligned sensory pairing. We further investigate the efficacy of our model by comparing it to three baselines: Deterministic Model using deterministic encoders (and trained without ELBO loss) as described in Sec. V-A, No Pairing that is trained without the sensory time-alignment prediction, and Reconstruction Model as described in Sec. V-B. Similar to the modality input ablation study, these learned representation models are subsequently fed to policies trained to insert a square peg using TRPO. As stated earlier, our Full Model completes insertion 78% of the time. In [39], the Deterministic Model completed insertion at 76% success rate for a 3-DoF peg insertion task. With the additional orientation control, the Deterministic Model task success rate drops to 24%. This drop in performance demonstrates the challenge of learning an insertion task with both rotation and translation actions, as well as the efficacy of the probabilistic encoder. Recent work that studies deterministic and variational inference approaches have shown that variational inference regularizes learning by enforcing a smooth latent space structure [28]. In our work, we see signs of the Deterministic Model overfitting. Compared to the training losses, the test losses for contact prediction and pairing prediction increase by a factor of 312.11 and 2514 respectively (see Table I).
We observe that our self-supervised training objectives are important for achieving the Full Model performance, especially the time-alignment pairing loss. We see that No Pairing affects the policy learning the most, with insertion rates dropping to 22%. According to [48], deciding whether sensory streams are time-aligned requires the detection of co-occuring patterns across modalities. These patterns provide evidence for a common underlying event, e.g. making or breaking contact. The importance of the pairing loss for task performance suggests that learning these patterns that co-occur between modalities provides a strong learning signal. The policy that uses Reconstruction Model representation learns at a faster rate and performs more insertions than the Deterministic Model and No Pairing. However, the full insertion rate of 36% is still less than half of the insertion rate of the Full Model. For contact-rich tasks, our action-conditional, self-supervised objectives are easier to learn than the full reconstruction objectives, and also are more suitable for policy learning.
VIII-A3 Representation Dimensions
We evaluate how compact the representation needs to be for contact-rich manipulation task by changing the dimensionality of the multimodal representation. We hypothesize that while a more compact representation can make reinforcement learning more tractable, it also captures less information about the state. We test several dimensions: d=16, d=64, d=128 (our Full Model), and d=256. We see that the d=16 model can only fully insert 18% of the time. It also has the lowest training accuracy for contact prediction (95.3%) and highest end-effector pose prediction loss (1.56E-02) compared to the other representation dimension sizes (see Table I and Appendix A, Table III). This suggests that the model captures too little information about the state for the task. While Full Model performs the best with 78% full insertion, the performance of the policy drops by more than a third when the representation size increases to d=256. As seen in Table I, d=256 has a lower ratio between test loss and training loss for all the predictions except for pairing loss, with comparable absolute losses in each category (as seen in the Appendix A, Table III), this suggests that the model learned the predictions well with little overfitting. The drop in performance in policy learning for d=256 might be due to the increase of the state space for the policy.
VIII-B Real Robot Experiments
In our previous work [39], we evaluated the Deterministic Model with a 3D action space representing Cartesian position displacements. On the physical robot platform we evaluated the policies with round, triangular, and semicircular pegs. In this work, we evaluate our Full Model on the real hardware with triangular and semicircular pegs only, since the circular peg does not require orientation control for insertion. In contrast to simulation, the difficulty of sensor synchronization, variable delays from sensing to control, and complex real-world dynamics introduce additional challenges. We make the task tractable on a real robot by training a shallow neural network controller after freezing the multimodal representation model when it is able to generate action-conditional flows with low endpoint errors (see Fig. 6).
We train the policy networks for 450 episodes, each lasting 1000 steps, roughly 7 hours of wall-clock time. We evaluate each policy for 50 episodes in Fig. 7. The first two bars correspond to the set of experiments where we train a specific representation model and policy for each type of peg. The robot achieves a level of success similar to that in simulation. A common strategy that the robot learns is to reach the box, search for the hole by sliding over the surface, align the peg with the hole, and finally perform insertion. More qualitative behaviors can be found in the supplementary video.
We further examine the potential of transferring the learned policies and representations to two novel shapes previously unseen in representation and/or policy learning: the hexagonal peg and the square peg. For policy transfer, we take the representation model and the policy trained for the triangular peg, and test it with the new pegs. From the 3rd and 4th bars in Fig. 7, we see that the policy achieves over 70% success rate on both pegs without any further policy training on them. A better transfer performance can be achieved by taking the representation model trained on the triangular peg, and training a new policy for the new pegs. As shown in the last two bars in Fig. 7, the resulting performance increases by 8% for the hexagonal peg and by 10% for the square peg. Our transfer learning results indicate that the multimodal representations from visual and haptic feedback generalize well across geometric variations of our contact-rich manipulation tasks.
Finally, we study the robustness of our policy in the presence of sensory noise and external perturbations to the arm by periodically occluding the camera and pushing the robot arm during trajectory roll-out. The policy is able to recover from both the occlusion and perturbations. Qualitative results can be found in our supplementary video on our website: https://sites.google.com/view/visionandtouch.
IX Discussion and Conclusion
We examined the value of learning a joint representation of time-aligned multisensory data for contact-rich manipulation tasks. To enable efficient real robot training, we proposed a novel model to encode heterogeneous sensory inputs into a compact multimodal latent representation. Once trained, the representation remained fixed when being used as input to a shallow neural network policy for reinforcement learning. We trained the representation model with self-supervision, eliminating the need for manual annotation. Our experiments with tight clearance peg insertion tasks indicated that they require the multimodal feedback from both vision (RGB and depth) and touch. We also showed that models trained with our proposed self-supervised action-conditional prediction and time-alignment pairing prediction objective surpass models trained on reconstruction objectives. Our ablation studies show that the pairing prediction objective during representation learning is especially important for policy performance, as the prediction allows our representation to learn the relationship between the sensor modalities. By varying the dimension of the latent space representation, we observed that a larger latent space can better learn the self-supervised objectives. When the latent space is too large, it can adversely affect the policy learning. In other words, the size of the latent space is a trade-off between capturing enough information of the state and keeping the policy state space compact. It would be beneficial to study more principled methods of making this trade-off.
On the real robot, we demonstrated that the multimodal representations transfer well to new task instances of peg insertion. For future work, we plan to extend our method to other contact-rich tasks, which require a full 6-DoF controller of position and orientation. We would also like to explore the value of incorporating richer modalities, such as sound, temperature, and proximity sensors, into our representation learning pipeline, as well as new sources of self-supervision.
X Acknowledgements
This work has been partially supported by JD.com American Technologies Corporation (“JD”) under the SAIL-JD AI Research Initiative and by the Toyota Research Institute ("TRI"). This article solely reflects the opinions and conclusions of its authors and not of JD, any entity associated with JD.com, TRI, or any entity associated with Toyota.
Appendix A Representation Training Details
For representation learning, we used Adam [33] to run stochastic gradient descent over the prediction objectives as described in Sec. IV-D. These were the hyperparameters used during representation learning for both simulation and real robot data.
These are the representation learning prediction losses on the training and testng dataset after training for 20 epochs. The ratio between the training and testing losses can be seen in Table I.
Appendix B Reinforcement Learning Details
These are the hyperparameters for policy learning using TRPO. We based our implementation of TRPO off of [35].
- On the real robot we increased and decreased to these values during the last hour of training as it stabilized the learning.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Fares J Abu-Dakka et al. “Adaptation of manipulation skills in physical contact with the environment to reference force profiles” In Autonomous Robots 39.2 Springer, 2015, pp. 199–217
- 2[2] Pulkit Agrawal et al. “Learning to poke by poking: Experiential learning of intuitive physics” In Advances in Neural Information Processing Systems , 2016, pp. 5074–5082
- 3[3] Marcin Andrychowicz et al. “Learning dexterous in-hand manipulation” In ar Xiv preprint ar Xiv:1808.00177 , 2018
- 4[4] Mohammad Babaeizadeh et al. “Stochastic variational video prediction” In ar Xiv preprint ar Xiv:1710.11252 , 2017
- 5[5] Y. Bekiroglu, R. Detry and D. Kragic “Learning tactile characterizations of object- and pose-specific grasps” In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems , 2011, pp. 1554–1560 DOI: 10.1109/IROS.2011.6094878 · doi ↗
- 6[6] Y. Bekiroglu, D. Song, L. Wang and D. Kragic “A probabilistic framework for task-oriented grasp stability assessment” In 2013 IEEE International Conference on Robotics and Automation , 2013, pp. 3040–3047 DOI: 10.1109/ICRA.2013.6630999 · doi ↗
- 7[7] A Bicchi, M Bergamasco, P Dario and A Fiorillo “Integrated tactile sensing for gripper fingers” In Int. Conf. on Robot Vision and Sensory Control , 1988
- 8[8] Randolph Blake, Kenith V Sobel and Thomas W James “Neural synergy between kinetic vision and touch” In Psychological science 15.6 SAGE Publications Sage CA: Los Angeles, CA, 2004, pp. 397–402
