Exploring Convolutional Networks for End-to-End Visual Servoing
Aseem Saxena, Harit Pandya, Gourav Kumar, Ayush Gaud, K. Madhava, Krishna

TL;DR
This paper introduces an end-to-end convolutional neural network approach for visual servoing that learns visual features directly from images, enabling robust control in diverse, unstructured environments without prior scene knowledge.
Contribution
It presents a novel deep learning method for visual servoing that eliminates the need for handcrafted features and prior scene information, applicable in real-world scenarios.
Findings
Effective in simulation and real-world quadrotor tests
Robust across indoor and outdoor environments
Handles diverse camera poses without prior scene knowledge
Abstract
Present image based visual servoing approaches rely on extracting hand crafted visual features from an image. Choosing the right set of features is important as it directly affects the performance of any approach. Motivated by recent breakthroughs in performance of data driven methods on recognition and localization tasks, we aim to learn visual feature representations suitable for servoing tasks in unstructured and unknown environments. In this paper, we present an end-to-end learning based approach for visual servoing in diverse scenes where the knowledge of camera parameters and scene geometry is not available a priori. This is achieved by training a convolutional neural network over color images with synchronised camera poses. Through experiments performed in simulation and on a quadrotor, we demonstrate the efficacy and robustness of our approach for a wide range of camera poses in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Exploring Convolutional Networks for End-to-End Visual Servoing
Aseem Saxena1, Harit Pandya1, Gourav Kumar1, Ayush Gaud1 and K. Madhava Krishna1 *Equal contribution.1 International Institute of Information Technology, Hyderabad, India.{harit.pandya, gourav.kumar}@research.iiit.ac.in{aseem.bits, ayush.gaud}@gmail.com, [email protected] Pandya is supported by TCS Reserach PhD fellowship.
Abstract
Present image based visual servoing approaches rely on extracting hand crafted visual features from an image. Choosing the right set of features is important as it directly affects the performance of any approach. Motivated by recent breakthroughs in performance of data driven methods on recognition and localization tasks, we aim to learn visual feature representations suitable for servoing tasks in unstructured and unknown environments. In this paper, we present an end-to-end learning based approach for visual servoing in diverse scenes where the knowledge of camera parameters and scene geometry is not available a priori. This is achieved by training a convolutional neural network over color images with synchronised camera poses. Through experiments performed in simulation and on a quadrotor, we demonstrate the efficacy and robustness of our approach for a wide range of camera poses in both indoor as well as outdoor environments.
I INTRODUCTION
Visual servoing (VS) refers to the control of robot motion using data from vision sensors. Vision sensor integration enables robotic systems to work outside controlled industrial settings. As a consequence, it has applications in diverse areas such as robotic surgery, autonomous navigation and manipulation for household robots. The objective is to move the robot in Cartesian space from an arbitrary starting pose (location and orientation) to a fixed goal pose. This is achieved by iteratively minimizing the error between the current and goal pose.
Position Based Visual Servoing (PBVS) defines the error in Euclidean space, which results in a simpler control law and minimal length trajectory [1]. However, PBVS requires a D model of the scene and camera parameters to be known before hand, which is a major bottleneck in practical implementation of PBVS methods. Image based visual servoing (IBVS), on the other hand, describes the error function in image space by extracting a set of visual features. IBVS controller attempts to move the robot in such a way that the visual features attain the desired configuration i.e. it minimises the error in image space. This requires a mapping of the feature velocity in image space to robot motion in Euclidean space via the analytical computation of image Jacobian that leads to various issues such as attaining local minima, exceeding joint limit and so forth [2]. Another issue with IBVS approaches is the extraction of unambiguous features that truly represent the pose information, which is a non-trivial task. Classical IBVS approaches use geometrical primitives like locations of points, lines, contours etc. as visual features [1]. However, these methods require an accurate feature matching for convergence. Recent IBVS methods consider appearance based features such as pixel intensities [3], image gradients [4] etc. These methods do not require an explicit matching step, however, the number of features is typically very large that results in a smaller convergence domain.
In this paper, we address the following question - is it possible to learn the motion required to attain a desired pose from an initial pose only from visual feedback? Recent breakthroughs in computer vision suggest that data driven frameworks efficiently learn high-level semantic representations from images, especially for a large number of examples [5]. Motivated by recent advances in machine learning, especially deep learning, we aim to learn an optimal set of image representations that estimate the relative transformation required to attain a desired pose. We explore convolutional neural network (CNN) architectures to learn such transformations in an end-to-end paradigm. Unlike other visual servoing approaches, our framework eliminates the need for extraction and tracking of features. Moreover, prior knowledge about the camera intrinsics and the scene’s D geometry is not required. Experiments show that our model also has a large convergence domain across a variety of synthetic and real world scenarios.
In this paper, we present a Convolutional Neural Network trained for performing visual servoing on diverse environments without knowledge of the underlying scene’s geometry. We have trained the network on the publicly available 7-Scenes dataset [6] as this dataset provides large variations across scenes and covers a wide range of camera transformations between frames. We evaluate our network on synthetic D models using free camera paradigm and on a real world scene using a quadrotor. Our simulation based testing framework allows us to compute ground truth camera transformations that can be used to compute the error of our system’s estimates. Figure 1 shows an exemplar of servoing result. Figure 1(a) shows the scene on which servoing was performed. Figure 1(b-d) shows initial pose, desired pose and resultant pose attained after visual servoing. Note that although there is large camera motion between initial and desired pose, the camera still reaches close to the desired pose using our method.
II Related work
Most of the previous image based visual servoing approaches rely on hand-crafted visual features for representing images. The control law could be seen as gradient descent over the feature error [3]. This requires image Jacobian or interaction matrix to be computed analytically. For several features widely used by modern computer vision techniques, it is difficult to represent analytically, for instance, Histogram of Oriented Gradients (HOG). Another line of approaches intend to numerically estimate the interaction matrix. However, due to high non-linearity in interaction matrix, it is difficult to get an accurate estimation. Also, numerical methods are vulnerable to conditioning and singularity issues. Neural Network based methods have been used for learning interaction matrix but the selection of features was hand-engineered. Readers may refer to [7] for a detailed review of Jacobian learning and estimation methods. Recently, support vector machines have been used to learn pose specific representations for visual servoing across object instances [8]. Again, the interaction matrix was numerically estimated. There has been significant work on reinforcement learning (RL) approaches [9], [10] for end-to-end visual servoing. However, parameters learned by RL are specific to the environment and task, hence it becomes difficult to generalise RL for new environments. On the other hand, our approach is end-to-end i.e. we learn visuomotor representations for direct control. Moreover, our approach generalises well on unknown environments.
Techniques for pose estimation, camera relocalization and visual odometry have been successfully applied in approaching the visual servoing problem. There have been works on absolute scene 6D pose estimation from a single monocular image in the recent past which are data-driven [11], [12]. Kendal’ et al. [12] train a CNN to regress the D pose of the camera from a single monocular image in real time. Our approach differs from theirs as we wish to learn relative camera pose from a pair of images. As natural scenes change over time, systems which estimate the absolute pose of a scene are bound to falter as a viewpoint can be remarkably different visually from the same viewpoint at a different time. Rather, we consider the relative pose between two frames to be much more meaningful. Two images with sufficient scene overlap offer more information and context than a single image. Some recent works have approached the problem of camera ego-motion estimation which has applications in visual servoing [13], [14], [15]. Agarawal et al. [13] explore the idea of feature learning using egomotion as ground truth instead of manually annotated labels. They demonstrate camera ego-motion estimation by learning a Siamese Style CNN with two images as input and the relative camera transformation as the ground truth. However, our work significantly differs from theirs in multiple ways. We perform regression over the image pairs whereas they perform classification. Also, we use a different Network architecture, loss function and optimization scheme for our task. Costante et al. [14] train a network to estimate frame to frame visual odometry by taking the optical flow between image pairs as input. Ours does not require the computation of optical flow. To the best of our knowledge, there has been no work which directly addresses the problem of visual servoing by leveraging powerful CNN based image features.
Contributions: Our contributions could be summarised as following. Firstly, we present a CNN based learning framework for visual servoing. Our framework generalises well over a wide range of synthetic and real world scenarios. We rigorously and systematically evaluate our approach in simulation and on real scenarios using a quadrotor. Secondly, as there are no benchmarking datasets for visual servoing and due to the presence of dynamic control, it is not feasible to provide access to pre-capture images similar to most of the available datasets. We would publicly release D models of scenes used for testing along with the necessary scripts.
III Overview
Assuming eye-in-hand configuration and world origin coinciding with the given object’s center, we denote the camera’s pose in SE(3) at given time in a fixed global frame as c. Given a scene and a desired camera pose in the same global frame , the goal of a visual servoing scheme is to find a camera transformation , such that . For image based visual servoing (IBVS), current pose and the desired pose are represented in the form of a set of features extracted from images and . Where is the camera’s intrinsic matrix and is the feature selection criterion. For IBVS, the goal is modified to finding the transformation such that the error in features is regulated to zero at desired pose. The task is achieved by minimizing e iteratively and controlling the camera velocity, , where is the interaction matrix that maps the rate of change of features to velocity and represents pseudoinverse operation as defined in [1]. On the other hand, in position based visual servoing (PBVS), the camera pose at any given time is inferred from the scene and image measurements. However, inferring camera pose from a single image requires the knowledge of the scene and camera parameters. In this work, we aim to jointly learn the representations s the describe the image and error e from a pair of images and without camera parameters and geometry of the scene . Our framework takes RGB images and as the input and estimates the camera transformation thereby waving the requirement for any feature computing or matching step. Further, using image based feedback, we estimate the control commands to attain the desired camera pose . Figure 2 shows the overall pipeline of the proposed framework.
IV Network Architecture
IV-A FlowNet
Convolutional Neural Networks have recently been shown to perform well on large scale visual recognition tasks [5]. In the recent past, research on training CNNs for per pixel prediction tasks such as optical flow has started to surface [16]. FlowNet by Fischer et al. [16] approaches the problem of optical flow in a supervised learning setting. Optical flow prediction involves both per pixel localization and learning powerful representations. We leverage these aspects of FlowNet to learn camera ego-motion. The motivation behind this effort is that traditionally optical flow has been used to estimate visual odometry [17]. A network which could robustly estimate optical flow would also be able to estimate camera ego-motion since both problems involve correspondences between image pixels. FlowNet is trained to predict optical flow using image pairs as input and their x-y flow fields as ground truth (Figure 3(a)). The images are stacked together to form a 6 channel image which is passed through multiple convolutions and ReLu non-linearities. Convolutional Neural Networks involve downscaling of feature maps, which is necessary for the training phase to be computationally feasible. As optical flow is a per pixel prediction task, it would require a feature map which is up to scale to predict a flow field of higher resolution. In order to provide dense per-pixel predictions, ’upconvolution’ is performed on the coarse feature map to get it up to scale. ’Upconvolution’ involves unpooling (bilinear upscaling of the feature map) followed by a convolution (refer Figure 3(b)). Similar layers have been used previously [18]. In this way, information from both coarser and finer feature maps is preserved. Upconvolution is performed at multiple scales which ultimately results in a two channel feature map which is 16 times the resolution of the last coarse feature map and 1/4 times the resolution of the image input. Our network differs slightly from FlowNet as we discard the loss layer of FlowNet and instead feed the final feature map to a fully connected layer with ReLu non linearity and dropout followed by separate regression layers for translation and rotation [12].
V Training
V-A Loss function and Optimization Scheme
Our network takes in two monocular images , and outputs a pose vector p comprising of a relative ( with respect to ) translation x and rotation q in quaternion form.
[TABLE]
To regress relative pose, we consider the following objective loss function similar to [12].
[TABLE]
is chosen so as to keep the expected value of translation and rotation errors to be equal. We found as to be optimal for training. The motive behind deploying the loss function is that both the translation and rotation regressors are loosely coupled and therefore, are not being denied the information to factor position from orientation and vice versa.
V-B Data-Preparation and Implementation Details
We use the Train Split of 7 Scenes Dataset to train our networks. Ground truth is present for each frame in the form of homogeneous matrix. For an image frame in a sequence, we take only 10 temporally close frames for computing the ground truth transformation, this is done to ensure that there is partial scene overlap in the two images. Let the absolute pose in homogeneous coordinates of and be and (with respect to some world origin ) respectively, then the Transformation of with respect to is given by:
[TABLE]
We obtain approximately 500,000 such training image pairs. For training on FlowNet architecture, we resize the images to and pass it for training. We use FlowNet’s mean subtraction layer to normalize the image data. We use Caffe library [19] to train our networks. The machine has a core i7 processor with 64 GB of RAM. We used a single Titan X GPU to perform training and testing. It took an hour to complete 1000 iterations with a batch size of 40. We perform transfer learning [20], [21] with the official FlowNet model weights released by the authors. This is done in order to get a better network initialization and faster network convergence. We use Adam optimization scheme instead of stochastic gradient descent for minimizing the loss function as it showed faster convergence for training during experiments. We train with base learning rate as reduced by every iterations. We take momentum, momentum2 parameters of Adam solver to be and respectively. We use the network weights obtained after iterations of training for all our experiments.
V-C Dataset
We train our network on the RGB-D ’7 Scenes Dataset’ [6]. It comprises of seven scenes of varying spatial extent and clutter as shown in figure 4. The Dataset is challenging due to the presence of image artifacts such as motion blur and reflections. Also, presence of texture-less flat surfaces, sensor noise and varying lighting conditions compound the challenge. We chose this dataset as it comprises of multiple trajectories with a variety of rotational and translational transformations between frames. This would enable us to learn a rich variety of poses with challenging image pairs.
VI Control Law
The network is trained to compute relative error in pose given two images and . We consider an object centric coordinate system with a frame attached to an object. , denote the current and desired camera frames. In our PBVS scheme, Consequently, and . This formulation results in a decoupling of rotational and translational motions and a simple control law as follows:
[TABLE]
Where, and are the relative rotation and translation of camera’s desired pose with respect to camera’s initial pose in frame . and are predicted by our network, given and . is the step size for rotational and translational velocities.
VII Experiments and Results
We evaluate our approach on non-planar scenes. Since there is no publicly available dataset that allows us to render an entire scene from a given viewpoint, we introduce a new synthetic dataset VSSD consisting of detailed CAD models of various scenes. We use the OpenRAVE simulation framework [22] for rendering scenes since it allows us to quantitatively measure the performance of our approach as the desired camera pose is known in the world frame. Thirdly, we qualitatively evaluate the performance of our approach on our dataset for various initial and desired poses. Finally, we execute our approach in a real environment using a quadrotor. Note that for all the experiments we do not assume any knowledge of camera parameters or depth information of the scene. Another fact worth noting is that the images used in evaluation were not encountered during training of the CNN Model. For simulation experiments we consider free-flying camera model. All the experiments reported here are performed on a system with Intel i7 processor and 64 GB RAM and a single 12 GB Titan X GPU. On this system our approach takes s for initialization i.e. loading the network into GPU and henceforth every iteration takes ms of which, majority of the time is consumed in forward pass of the network.
VII-A Visual Servoing Scene Dataset
There are several publicly available datasets for tracking and localization [23], [6]. However, for visual servoing it is difficult to release such a dataset, as it requires image based measurements of the environment where viewpoint changes dynamically. We address this issue by using synthetic D models. In the recent past, D models have been used by the computer graphics and vision communities to produce large amounts of synthetic data which enable better generalisation for deep learning models [24]. However, these datasets are limited to shapes and objects. Recently Handa et al. [25] released a synthetic dataset for scenes. However, the scenes provided are purely depth-based, which makes it unsuitable for visual servoing purposes. D positioning is performed for physical objects which limits the scope of reproducing the results for benchmarking purposes.
For this work, we have generated Visual Servoing Scene Dataset (VSSD) by rendering scenes using textured synthetic D models publicly available from Google D warehouse [26]. We have ensured to diversify scenes based on the following criterion:
- •
We have selected models that represent indoor, outdoor and object categories.
- •
The scenes are sufficiently large to capture large camera transformations.
- •
These scenes have non-homogeneous lightning conditions.
- •
Viewpoints in the scenes vary in texture.
The main motivation behind the effort is to provide a wide range of test cases for systematically evaluating visual servoing approaches on a common benchmark. All the scenes (CAD models) used in the dataset are publicly available and could be download at our project page111http://robotics.iiit.ac.in/urls/d173716a.htm.
VII-A1 Simulation results for D scene
In this experiment we aim to evaluate the control laws for our network architecture and to evaluate robustness in performing a positioning task. The initial pose (refer figure 6(a)) is selected from ”House” model of VSSD dataset. The difference between desired and initial pose . Although, the relative camera transformation is large, our approach is still able to converge to the desired pose with error in camera pose as , which is around in both translation as well as rotation. It could be seen from figure 6(e-g) that both the error as well as the camera velocity decrease exponentially despite the fact that these are output by a CNN. The experiment demonstrates that visuomotor representations are indeed learnt by our system. Also, the camera trajectory as shown in figure 6(h)is close to a straight line, which is desirable for visual servoing purposes.
VII-A2 Qualitative results on servoing dataset
The objective of this experiment is to show the efficacy of the proposed algorithm to servo to a diverse set of target instances across various environment and viewpoint variations. For every scene from the VSSD dataset, we evaluate our algorithm for two configurations of the initial and desired pose pair, with different transformations in DOF. The resultant error images from figure 5 indicate that our CNN based approach is indeed able to attain the desired pose for large camera pose variations. Let us note that VSSD has non-homogeneous lighting conditions, hence the assumption of temporal luminance continuity made by previous featureless visual servoing approaches [3], [4] does not apply to such scenes. Also, the scene ”kitchen” has textureless surfaces, which would make feature extraction difficult. This experiment validates the robustness of the feature representations learnt by the network for diverse and challenging environments without prior knowledge of the scene or camera used.
VII-A3 Real experiment using a quadrotor
In this experiment, we evaluate our approach on real world scenarios using a Parrot Bebop 2 drone. Since quadrotors are under-actuated, only DOF tasks were selected for visual servoing. In real world, it is difficult to accurately predict the position of a drone. Hence we report the qualitative results and an approximate trajectory generated and reported by the drone by fusion of inertial measurement unit (IMU) , sonar sensor and optical flow sensor facing downward. Note that the images in the evaluation were not encountered during training of the CNN model. Again, the transformation between the initial and the desired pose is large. Precise convergence was not achieved since only DOF could be controlled. Figure 7(a,b) show the initial and desired pose given to the CNN for generating velocity commands. The local controller aimed to track the quadrotor velocity commands generated by the CNN based high-level controller. The CNN forward pass processing was performed using a laptop computer with Core i7 CPU, Nvidia Quadro M2000M GPU and 16 GB RAM. It took ms for one forward pass to complete on the machine. The drone was given 2 seconds to converge to the generated velocity before capture and forward pass of next image hence sending next velocity command. The image captured by the drone and corresponding control commands generated by the network were exchanged between the system and drone over WiFi channel.
VIII Conclusion
In this work, we have introduced an end-to-end learning based framework for visual servoing tasks using CNN. The visuomotor representations learnt by the network generalises well across diverse environments. We have experimentally verified our approach on both synthetic as well as real world scenarios for robustness to non-homogeneous illumination and texture of scene. Unlike previous approaches, we do not need the knowledge of geometry of scene or camera parameters. Further, by learning the control representations we circumvent the requirement of any feature extraction or tracking step.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] F. Chaumette and S. Hutchinson, “Visual servo control. i. basic approaches,” IEEE RAM , vol. 13, no. 4, pp. 82–90, 2006.
- 2[2] F. Chaumette, “Potential problems of stability and convergence in image-based and position-based visual servoing,” in The confluence of vision and control . Springer, 1998, pp. 66–78.
- 3[3] C. Collewet and E. Marchand, “Photometric visual servoing,” IEEE TRO , vol. 27, no. 4, pp. 828–834, 2011.
- 4[4] E. Marchand and C. Collewet, “Using image gradient as a visual feature for visual servoing,” in IROS , 2010, pp. 5687–5692.
- 5[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS , 2012, pp. 1097–1105.
- 6[6] B. Glocker, S. Izadi, J. Shotton, and A. Criminisi, “Real-time rgb-d camera relocalization,” in ISMAR , 2013, pp. 173–179.
- 7[7] F. Chaumette and S. Hutchinson, “Visual servo control, part ii: Advanced approaches,” IEEE RAM , vol. 14, no. 1, pp. 109–118, 2007.
- 8[8] H. Pandya, K. M. Krishna, and C. Jawahar, “Servoing across object instances: Visual servoing for object category,” in ICRA , 2015, pp. 6011–6018.
