Video Object Segmentation-based Visual Servo Control and Object Depth   Estimation on a Mobile Robot

Brent A. Griffin; Victoria Florence; Jason J. Corso

arXiv:1903.08336·cs.RO·January 13, 2020

Video Object Segmentation-based Visual Servo Control and Object Depth Estimation on a Mobile Robot

Brent A. Griffin, Victoria Florence, Jason J. Corso

PDF

2 Repos

TL;DR

This paper presents a novel approach combining video object segmentation with visual servo control and depth estimation on a mobile robot, enabling real-time object identification, localization, and grasping without camera calibration.

Contribution

It introduces a segmentation-based visual servo control method and a Hadamard-Broyden update formulation for quick learning of actuator-visual feature relationships without calibration.

Findings

01

Successfully identified and grasped objects from the YCB dataset.

02

Tracked people and articulated objects in real-time.

03

Validated on a mobile HSR robot with various configurations.

Abstract

To be useful in everyday environments, robots must be able to identify and locate real-world objects. In recent years, video object segmentation has made significant progress on densely separating such objects from background in real and challenging videos. Building off of this progress, this paper addresses the problem of identifying generic objects and locating them in 3D using a mobile robot with an RGB camera. We achieve this by, first, introducing a video object segmentation-based approach to visual servo control and active perception and, second, developing a new Hadamard-Broyden update formulation. Our segmentation-based methods are simple but effective, and our update formulation lets a robot quickly learn the relationship between actuators and visual features without any camera calibration. We validate our approach in experiments by learning a variety of actuator-camera…

Tables2

Table 1. Table 1: VOS-VS Hadamard-Broyden Update Configurations. 𝐉 𝐬 + ^ ^ superscript subscript 𝐉 𝐬 \widehat{\mathbf{J}_{\mathbf{s}}^{+}} values are learned online using our HB update formulation ( 11 ), enabling HSR to automatically learn the relationship between actuators and visual features without any camera calibration.

$𝐇$ (11)		Learned $\frac{\partial q_{i}}{\partial s_{j}}$ in $\hat{𝐉_{𝐬}^{+}}$ (13)
Config.	Camera	$s_{x}$		$s_{y}$
$𝐇_{head}$	Head	$q_{head pan}$	0.00173	$q_{head tilt}$	0.00183
$𝐇_{arm lift}$	Grasp	$q_{arm lift}$	-0.00157	$q_{arm roll}$	0.00321
$𝐇_{arm wrist}$	Grasp	$q_{wrist flex}$	-0.00221	$q_{arm roll}$	0.00445
		$q_{arm lift}$	-0.00036
$𝐇_{arm both}$	Grasp	$q_{wrist flex}$	-0.00392	$q_{arm roll}$	0.00328
$𝐇_{base}$	Grasp	$q_{base forward}$	-0.00179	$q_{base lateral}$	0.00173
$𝐇_{base grasp}$	Grasp	$q_{base forward}$	-0.00040	$q_{base lateral}$	0.00040

Table 2. Table 2: Consecutive Mobile Robot Trial Results. All results are from a single consecutive set of mobile HSR trials. Across all of the challenge objects, VOS-VS has a 83% success rate. Except for one VOS-DE trial, the food objects were a complete success.

	Object	Support	Success
Item	Category	Height (m)	VS	DE
Chips Can	Food	0.25	X	X
Potted Meat	Food	0.125	X	X
Plastic Banana	Food	Ground	X	X
Box of Sugar	Food	0.25	X	X
Tuna	Food	0.125	X
Gelatin	Food	Ground	X	X
Mug	Kitchen	0.25	X	X
Softscrub	Kitchen	0.125		N/A
Skillet with Lid	Kitchen	Ground		N/A
Plate	Kitchen	0.25	X	X
Spatula	Kitchen	0.125		N/A
Knife	Kitchen	Ground	X
Power Drill	Tool	0.25	X	X
Marker	Tool	0.125	X
Padlock	Tool	Ground	X
Wood	Tool	0.25	X
Spring Clamp	Tool	0.125	X
Screwdriver	Tool	Ground	X
Baseball	Shape	0.25	X
Plastic Chain	Shape	0.125	X
Washer	Shape	Ground	X
Stacking Cup	Shape	0.25	X	X
Dice	Shape	0.125		N/A
Foam Brick	Shape	Ground	X	X

Equations44

q = [q_{head tilt}, q_{head pan}, \dots, q_{base roll}]^{⊺} .

q = [q_{head tilt}, q_{head pan}, \dots, q_{base roll}]^{⊺} .

M = vos (I, W),

M = vos (I, W),

s_{A} :=

s_{A} :=

s_{x} :=

s_{y} :=

e := s (I, W) - s^{*},

e := s (I, W) - s^{*},

\dot{s} = L_{s} v_{c},

\dot{s} = L_{s} v_{c},

v_{c} =

v_{c} =

Δ s =

Δ s =

Δ q =

\displaystyle\widehat{\mathbf{J}_{\mathbf{s}}^{+}}_{t+1}:=\widehat{\mathbf{J}_{\mathbf{s}}^{+}}_{t}+\alpha\Bigg{(}\frac{\big{(}\Delta\mathbf{q}-\widehat{\mathbf{J}_{\mathbf{s}}^{+}}_{t}\Delta\mathbf{e}\big{)}\Delta\mathbf{q}^{\intercal}\widehat{\mathbf{J}_{\mathbf{s}}^{+}}_{t}}{\Delta\mathbf{q}^{\intercal}\widehat{\mathbf{J}_{\mathbf{s}}^{+}}_{t}\Delta\mathbf{e}}\Bigg{)}\circ\mathbf{H},

\displaystyle\widehat{\mathbf{J}_{\mathbf{s}}^{+}}_{t+1}:=\widehat{\mathbf{J}_{\mathbf{s}}^{+}}_{t}+\alpha\Bigg{(}\frac{\big{(}\Delta\mathbf{q}-\widehat{\mathbf{J}_{\mathbf{s}}^{+}}_{t}\Delta\mathbf{e}\big{)}\Delta\mathbf{q}^{\intercal}\widehat{\mathbf{J}_{\mathbf{s}}^{+}}_{t}}{\Delta\mathbf{q}^{\intercal}\widehat{\mathbf{J}_{\mathbf{s}}^{+}}_{t}\Delta\mathbf{e}}\Bigg{)}\circ\mathbf{H},

e_{x, y} := s_{x, y} (M (I, W)) - s^{*} = [s_{x} s_{y}] - s^{*} .

e_{x, y} := s_{x, y} (M (I, W)) - s^{*} = [s_{x} s_{y}] - s^{*} .

J_{s}^{+} \approx \frac{\partial q}{\partial e _{x, y}} = \frac{\partial q}{\partial s _{x, y}} = \frac{\partial q _{head tilt}}{\partial s _{x}} \frac{\partial q _{head pan}}{\partial s _{x}} ⋮ \frac{\partial q _{base roll}}{\partial s _{x}} \frac{\partial q _{head tilt}}{\partial s _{y}} \frac{\partial q _{head pan}}{\partial s _{y}} ⋮ \frac{\partial q _{base roll}}{\partial s _{y}},

J_{s}^{+} \approx \frac{\partial q}{\partial e _{x, y}} = \frac{\partial q}{\partial s _{x, y}} = \frac{\partial q _{head tilt}}{\partial s _{x}} \frac{\partial q _{head pan}}{\partial s _{x}} ⋮ \frac{\partial q _{base roll}}{\partial s _{x}} \frac{\partial q _{head tilt}}{\partial s _{y}} \frac{\partial q _{head pan}}{\partial s _{y}} ⋮ \frac{\partial q _{base roll}}{\partial s _{y}},

ℓ_{1} d_{1} = ℓ_{2} d_{2} ⟹ \frac{ℓ _{2}}{ℓ _{1}} = \frac{d _{1}}{d _{2}},

ℓ_{1} d_{1} = ℓ_{2} d_{2} ⟹ \frac{ℓ _{2}}{ℓ _{1}} = \frac{d _{1}}{d _{2}},

\displaystyle A_{2}=A_{1}\bigg{(}\frac{\ell_{2}}{\ell_{1}}\bigg{)}^{2}\implies A_{2}=A_{1}\bigg{(}\frac{d_{1}}{d_{2}}\bigg{)}^{2},

\displaystyle A_{2}=A_{1}\bigg{(}\frac{\ell_{2}}{\ell_{1}}\bigg{)}^{2}\implies A_{2}=A_{1}\bigg{(}\frac{d_{1}}{d_{2}}\bigg{)}^{2},

d_{1} A_{1} = d_{2} A_{2} = c_{object},

d_{1} A_{1} = d_{2} A_{2} = c_{object},

d = z_{camera} - z_{object},

d = z_{camera} - z_{object},

(z_{camera, 1} - z_{object}) s_{A, 1} =

(z_{camera, 1} - z_{object}) s_{A, 1} =

=

z_{object} s_{A, 1} + c_{object} = z_{camera, 1} s_{A, 1},

z_{object} s_{A, 1} + c_{object} = z_{camera, 1} s_{A, 1},

s_{A, 1} s_{A, 2} ⋮ s_{A, m} 11 ⋮ 1 [\overset{z}{^}_{object} \overset{c}{^}_{object}] = z_{camera, 1} s_{A, 1} z_{camera, 2} s_{A, 2} ⋮ z_{camera, m} s_{A, m} .

s_{A, 1} s_{A, 2} ⋮ s_{A, m} 11 ⋮ 1 [\overset{z}{^}_{object} \overset{c}{^}_{object}] = z_{camera, 1} s_{A, 1} z_{camera, 2} s_{A, 2} ⋮ z_{camera, m} s_{A, m} .

z_{camera, grasp} = \overset{z}{^}_{object} + z_{gripper},

z_{camera, grasp} = \overset{z}{^}_{object} + z_{gripper},

ar g q_{wrist roll} min J (q_{wrist roll}) = \frac{M \cap M _{grasp} ( q _{wrist roll} )}{M \cup M _{grasp} ( q _{wrist roll} )},

ar g q_{wrist roll} min J (q_{wrist roll}) = \frac{M \cap M _{grasp} ( q _{wrist roll} )}{M \cup M _{grasp} ( q _{wrist roll} )},

s_{A, raised} > 0.5 s_{A, grasp},

s_{A, raised} > 0.5 s_{A, grasp},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Video Object Segmentation-based Visual Servo Control

and Object Depth Estimation on a Mobile Robot

Brent A. Griffin Victoria Florence Jason J. Corso

University of Michigan

{griffb,vflorenc,jjcorso}@umich.edu

Abstract

To be useful in everyday environments, robots must be able to identify and locate real-world objects. In recent years, video object segmentation has made significant progress on densely separating such objects from background in real and challenging videos. Building off of this progress, this paper addresses the problem of identifying generic objects and locating them in 3D using a mobile robot with an RGB camera. We achieve this by, first, introducing a video object segmentation-based approach to visual servo control and active perception and, second, developing a new Hadamard-Broyden update formulation. Our segmentation-based methods are simple but effective, and our update formulation lets a robot quickly learn the relationship between actuators and visual features without any camera calibration. We validate our approach in experiments by learning a variety of actuator-camera configurations on a mobile HSR robot, which subsequently identifies, locates, and grasps objects from the YCB dataset and tracks people and other dynamic articulated objects in real-time.

1 Introduction

Visual servo control (VS), using visual data in the servo loop to control a robot, is a well-established field [11, 28]. Using features from RGB images, VS has been used for positioning UAVs [26, 43] and wheeled robots [35, 42], manipulating objects [29, 54], and even laparoscopic surgery [56]. While this prior work attests to applicability of VS, generating robust visual features for VS in unstructured environments with generic objects (e.g., without fiducial markers) remains an open problem.

On the other hand, video object segmentation (VOS), the dense separation of objects in video from background, has made recent progress on real, unstructured videos. This progress is due in part to the introduction of multiple benchmark datasets [47, 49, 57], which evaluate VOS methods across many challenging categories, including moving cameras, occlusions, objects leaving view, scale variation, appearance change, edge ambiguity, multiple interacting objects, and dynamic background; these challenges frequently occur simultaneously. However, despite all of VOS’s contributions to video understanding, we are unaware of any work that utilizes VOS for control.

To this end, this paper develops a VOS-based framework to address the problem of visual servo control in unstructured environments. We also use VOS to estimate depth without a 3D sensor (e.g., an RGBD camera in Figure 1 and [20, 53]). Developing VOS-based features for control and depth estimation has many advantages. First, VOS methods are robust across a variety of unstructured objects and backgrounds, making our framework general to many settings. Second, many VOS methods operate on streaming images, making them ideal for tracking objects from a moving robot. Third, ongoing work in active and interactive perception enables robots to automatically generate object-specific training data for VOS methods [6, 21, 32, 41]. Finally, VOS remains a hotly studied area of video understanding, and future improvements in the accuracy and robustness of state-of-the-art segmentation methods will similarly improve our method.

The primary contribution of our paper is the development and experimental evaluation of video object segmentation-based visual servo control (VOS-VS). We demonstrate the utility of VOS-VS on a mobile robot equipped with an RGB camera to identify and position itself relative to many challenging objects from HSR challenges and the YCB object dataset [9]. To the best of our knowledge, this work is first use of video object segmentation for control.

A second contribution is our new Hadamard-Broyden update formulation, which outperforms the original Broyden update in experiments and enables a robot to learn the relationship between actuators and VOS-VS features online without any camera calibration. Using our update, our robot learns to servo with seven unique configurations across seven actuators and two cameras. To the best of our knowledge, this work is the first use of a Broyden update to directly estimate the pseudoinverse feature Jacobian for visual servo control on a robot.

A final contribution is introducing two more VOS-based methods, VOS-DE and VOS-Grasp. VOS-DE combines segmentation features with Galileo’s Square-cube law and active perception to estimate an object’s depth, which, with VOS-VS, provides an object’s 3D location. VOS-Grasp uses segmentation features for grasping and grasp-error detection. Thus, using our approach, robots can find and grasp objects using a single RGB camera (see Figure 2).

We provide source code and annotated YCB object training data at https://github.com/griffbr/VOSVS.

2 Related Work

2.1 Video Object Segmentation

Video object segmentation methods can be categorized as unsupervised, which usually rely on object motion [19, 24, 33, 46, 55], or semi-supervised, which segment objects specified in user-annotated examples [5, 13, 23, 36, 45, 60]. Of particular interest to the current work, semi-supervised methods learn the visual characteristics of a target object, which enables them to reliably segment dynamic or static objects. To generate our VOS-based features, we segment objects using One-Shot Video Object Segmentation (OSVOS) [8], which is state-of-the-art in VOS and has influenced other leading semi-supervised methods [40, 51].

2.2 Visual Servo Control

In addition to the visual servo literature cited in Section 1, this paper builds off of other methods for control design and feature selection. For control design, a technique using a hybrid input of 3D Cartesian space and 2D image space is developed in [39], with depth estimation provided externally. As a step toward more natural image features, Canny edge detection-based planar contours of objects are used in [14]. When designing features, work in [38] shows that $z$ -axis features should scale proportional to the optical depth of observed targets. Finally, work in [15] controls $z$ -axis motions using the longest line connecting two feature points for rotation and the square root of the collective feature-point-polygon area for depth; this approach addresses the Chaumette Conundrum presented in [10] but also requires that all feature points remain in the image. Notably, early VS methods require structured visual features (e.g., fiducial markers), while recent learning-based methods require manipulators with a fixed workspace [1, 30, 62].

Taking advantage of recent progress in computer vision, this paper introduces robust segmentation-based image features for visual servoing that are generated from ordinary, real-world objects. Furthermore, our features are rotation invariant, work when parts of an object are out of view or occluded, and do not require any particular object viewpoint or marking, making this work applicable to articulated and deformable objects (e.g., the yellow chain in Figures 1-2). Finally, our method enables visual servo control on a mobile manipulation platform, on which we also use segmentation-based features for depth estimation and grasping.

2.3 Active Perception

A critical asset for robot perception is taking actions to improve sensing and understanding of the environment, i.e., Active Perception (AP) [3, 4]. Compared to structure from motion [2, 31, 34], which requires feature matching or scene flow to relate images, AP exploits knowledge of a robot’s relative position to relate images and improve 3D reconstruction. Furthermore, AP methods select new view locations explicitly to improve perception performance [17, 50, 61]. In this work, we use active perception with VOS-based features to estimate an object’s depth. We complete our estimate during our robot’s approach to an object, and, by tracking the estimate’s convergence, we can collect more data if necessary. Essentially, by using an RGB camera and kinematic information that is already available, we estimate the 3D position of objects without any 3D sensors, including: LIDAR, which is cost prohibitive and color blind; RGBD cameras, which do not work in ambient sunlight among other conditions (see Figure 1); and stereo cameras, which require calibration and feature matching. Even when 3D sensors are available, RGB-based methods provide an indispensable backup for perception [44, 52].

3 Robot Model and Perception Hardware

For our robot experiments, we use a Toyota Human Support Robot (HSR), which has a 4-DOF manipulator arm mounted on a torso with prismatic and revolute joints and a differential drive base [58, 59]. Using the revolute joint atop its differential drive base, we effectively control HSR as an omnidirectional robot. For visual servo control, we use the actuators shown in Figure 3 as the joint space $\mathbf{q}\in\mathbb{R}^{10}$ ,

[TABLE]

In addition to $\mathbf{q}$ , HSR’s end effector has a parallel gripper with series elastic fingertips for grasping objects; the fingertips have 135 mm maximum width.

For perception, we use HSR’s base-mounted UST-20LX 2D scanning laser for obstacle avoidance and the head-mounted Xtion PRO LIVE RGBD camera and end effector-mounted wide-angle grasp camera for segmentation. The head tilt and pan joints act as a 2-DOF gimbal for the head camera, and the grasp camera moves with the arm and wrist joints; both cameras stream 640 $\times$ 480 RGB images.

A significant component of HSR’s manipulation DOF comes from its mobile base. While many planning algorithms work well on high DOF arms with a stationary base, the odometer errors of HSR compound during trajectory execution and cause missed grasps. Thus, VS is well-suited for HSR and other mobile robots, providing visual feedback on an object’s relative position during mobile manipulation.

4 Segmentation-based Visual Servo Control

4.1 Segmentation-based Features

Assume we are given an RGB image $I$ containing an object of interest. Using VOS, we generate a binary mask

[TABLE]

where $M$ consists of pixel-level labels $\ell_{p}\in\{0,1\}$ , $\ell_{p}=1$ indicates pixel $p$ corresponds to the segmented object, and $\mathbf{W}$ are learned VOS parameters (details in Section 7.2).

Using $M$ , we define the following VOS-based features

[TABLE]

where $s_{A}$ is a measure of segmentation area by the number of labeled pixels, $s_{x}$ is the $x$ -centroid of the segmented object using $x$ -axis label positions $p_{x}$ , and $s_{y}$ is the equivalent $y$ -centroid. In addition to (3)-(5), we introduce more VOS features for depth estimation and grasping in Sections 5-6.

4.2 Visual Servo Control

Using VOS-based features for our visual servo control scheme, we define image feature error

[TABLE]

where $\mathbf{s}\in\mathbb{R}^{k}$ is the vector of visual features found in image $I$ using learned VOS parameters $\mathbf{W}$ and $\mathbf{s}^{*}\in\mathbb{R}^{k}$ is the vector of desired feature values. In contrast to many VS control schemes, $\mathbf{e}$ in (6) has no dependence on time, previous observations, or additional system parameters (e.g., camera parameters or 3D object models).

Typical VS approaches relate camera motion to $\mathbf{s}$ using

[TABLE]

where $\mathbf{L_{s}}\in\mathbb{R}^{k\times 6}$ is a feature Jacobian relating the three linear and three angular camera velocities $\mathbf{v_{c}}\in\mathbb{R}^{6}$ to $\dot{\mathbf{s}}$ . From (6)-(7), assuming ${\mathbf{\dot{s}}^{*}}=0\implies\dot{\mathbf{e}}=\dot{\mathbf{s}}=\mathbf{L_{s}}\mathbf{v_{c}}$ , we find the VS control velocities $\mathbf{v_{c}}$ to minimize $\mathbf{e}$ as

[TABLE]

where $\widehat{\mathbf{L}_{\mathbf{s}}^{+}}$ is the estimated pseudoinverse of $\mathbf{L_{s}}$ and $\lambda$ ensures an exponential decoupled decrease of $\mathbf{e}$ [11]. Notably, VS control using (8) requires continuous, six degree of freedom (DOF) control of camera velocity.

To make (8) more general for discrete motion planning and fewer required control inputs, we modify (7)-(8) to

[TABLE]

where $\Delta\mathbf{q}$ is the change of $\mathbf{q}\in\mathbb{R}^{n}$ actuated joints, $\mathbf{J_{s}}\in\mathbb{R}^{k\times n}$ is the feature Jacobian relating $\Delta\mathbf{q}$ to $\Delta\mathbf{s}$ , and $\widehat{\mathbf{J}_{\mathbf{s}}^{+}}$ is the estimated pseudoinverse of $\mathbf{J_{s}}$ . We command $\Delta\mathbf{q}$ directly to the robot joint space as our VOS-VS controller to minimize $\mathbf{e}$ and reach the desired feature values $\mathbf{s}^{*}$ in (6).

4.3 Hadamard-Broyden Update Formulation

In real visual servo systems, it is impossible to know the exact feature Jacobian ( $\mathbf{J_{s}}$ ) relating control actuators to image features [11]. Instead, some VS methods estimate $\mathbf{J_{s}}$ directly from observations [12]; among these, a few use the Broyden update rule [27, 29, 48], which iteratively updates online. In contrast to previous VS work, Broyden’s original paper provides a formulation to estimate the pseudoinverse feature Jacobian ( $\widehat{\mathbf{J}_{\mathbf{s}}^{+}}$ ) [7, (4.5)]. However, we found it necessary to augment Broyden’s formulation with a logical matrix $\mathbf{H}$ , and define our new Hadamard-Broyden update

[TABLE]

where $\alpha$ determines the update speed, $\Delta\mathbf{q}=\mathbf{q}_{t}-\mathbf{q}_{t-1}$ and $\Delta\mathbf{e}=\mathbf{e}_{t}-\mathbf{e}_{t-1}$ are the changes in joint space and feature errors since the last update, and $\mathbf{H}\in\mathbb{R}^{n\times k}$ is a logical matrix coupling actuators to image features. In experiments, we initialize (11) using $\alpha=0.1$ and $\widehat{\mathbf{J}_{\mathbf{s}}^{+}}_{t=0}=0.001\mathbf{H}$ .

The Hadamard product with $\mathbf{H}$ prevents undesired coupling between certain actuator and image feature pairs. In practice, we find that using the original Broyden update results in unpredictable convergence and learning gains for actuator-image feature pairs that are, in fact, unrelated. Fortunately, we find that using $\mathbf{H}$ in (11) enables real-time convergence without any calibration on the robot for all of the experiment configurations in Section 7.3.

4.4 VOS-VS Configurations

We learn seven unique VOS-VS configurations using our HB update. Using $s_{x}$ (4) and $s_{y}$ (5) in $\mathbf{e}$ (6), we define error

[TABLE]

Using $\mathbf{e}_{x,y}$ and HSR joints $\mathbf{q}$ (1), we choose $\widehat{\mathbf{J}_{\mathbf{s}}^{+}}$ in (11) as

[TABLE]

where $\widehat{\mathbf{J}_{\mathbf{s}}^{+}}\in\mathbb{R}^{10\times 2}$ . Note that in our Hadamard-Broyden update (11), each element $\frac{\partial q_{i}}{\partial s_{j}}$ in $\widehat{\mathbf{J}_{\mathbf{s}}^{+}}$ is multiplied by element $\mathbf{H}_{i,j}$ in the Hadamard product. Thus, we configure the logical coupling matrix $\mathbf{H}$ by setting $\mathbf{H}_{i,j}=1$ if coupling actuated joint $q_{i}$ with image feature $s_{j}$ is desired. Using our update formulation (11), we learn $\widehat{\mathbf{J}_{\mathbf{s}}^{+}}$ on HSR for the seven $\mathbf{H}$ configurations listed in Table 1 and provide experimental results for each configuration in Section 7.3.

5 Segmentation-based Depth Estimation

By combining VOS-based features with active perception, we are able to estimate the depth of segmented objects and approximate their 3D position. As shown in Figure 4, we initiate our depth estimation framework (VOS-DE) by centering the optical axis of our camera with a segmented object using the $\mathbf{H}_{\text{base}}$ VOS-VS controller. This alignment minimizes lens distortion, which facilitates the use of an ideal camera model. Using the pinhole camera model [22], projections of objects onto the image plane scale inversely with their distance on the optical axis from the camera.

Thus, with the object centered on the optical axis, we can relate projection scale and object distance using

[TABLE]

where $\ell_{1}$ is the projected length of an object measurement orthogonal to the optical axis, $d_{1}$ is the distance along the optical axis of the object away from the camera, and $\ell_{2}$ is the projected measurement length at a new distance $d_{2}$ . Combining Galileo Galilei’s Square-cube law with (14),

[TABLE]

where $A_{1}$ is the projected object area corresponding to $\ell_{1}$ and $d_{1}$ (see Figure 4). As the camera advances on the optical axis, we modify (15) to relate collected images using

[TABLE]

where $c_{\text{object}}$ is a constant proportional to the orthogonal surface area of the segmented object. Also, using a coordinate frame with the $z$ axis aligned with the optical axis,

[TABLE]

where $z_{\text{camera}}$ and $z_{\text{object}}$ are the $z$ -axis coordinates of the camera and object. Because the camera and object are both centered on the $z$ axis, $x_{\text{camera}}=x_{\text{object}}=0$ and $y_{\text{camera}}=y_{\text{object}}=0$ . Using (17) and $s_{A}$ (3), we update (16) as

[TABLE]

where the object is assumed stationary between images (i.e., $\dot{z}_{\text{object}}=0$ ) and the $z_{\text{camera}}$ position is known from the robot’s kinematics. Note that $z_{\text{camera}}$ provides relative depth for VOS-DE and (18) identifies a key linear relationship between $\sqrt{s}_{A}$ and the distance between the object and camera.

Finally, after collecting a series of $m$ measurements, we estimate the depth of the segmented object. From (18),

[TABLE]

which over the $m$ measurements in $\mathbf{A}\mathbf{x}=\mathbf{b}$ form yields

[TABLE]

By solving (20) for $\hat{z}_{\text{object}}$ and $\hat{c}_{\text{object}}$ , we estimate the distance $d$ in (17), and, thus, the 3D location of the object. In Section 7.4, we show that our combined VOS-VS and VOS-DE framework is sufficient for locating, approaching, and estimating the depth of a variety of unstructured objects.

Remark: There are many methods to find approximate solutions to (20). In practice, we find that a least squares solution provides robustness to outliers caused by segmentation errors (see visual and quantitative example in Figures 5-6).

6 Segmentation-based Grasping

We develop a VOS-based method of grasping and grasp-error detection (VOS-Grasp). Assuming an object is centered and has estimated depth $\hat{z}_{\text{object}}$ , we move $z_{\text{camera}}$ to

[TABLE]

where $z_{\text{gripper}}$ is the known $z$ -axis offset between $z_{\text{camera}}$ and the center of HSR’s closed fingertips. Thus, when $z_{\text{camera}}$ is at $z_{\text{camera, grasp}}$ , HSR can reach the object at depth $\hat{z}_{\text{object}}$ .

After moving to $z_{\text{camera, grasp}}$ , we center the object directly underneath HSR’s antipodal gripper using $\mathbf{H}_{\text{base grasp}}$ VOS-VS control. To find a suitable grasp location, we project and rotate a mask of the gripper, $M_{\text{grasp}}$ , into the camera as shown in column 5 of Figure 5 and solve

[TABLE]

where $\mathcal{J}$ is the intersection over union (or Jaccard index [18]) of $M_{\text{grasp}}$ and object segmentation mask $M$ , and $M_{\text{grasp}}(q_{\text{wrist roll}})$ is the projection of $M_{\text{grasp}}$ corresponding to HSR wrist rotation $q_{\text{wrist roll}}$ . Thus, we grasp the object using the wrist rotation with least intersection between the object and the gripper, which is then less likely to collide with the object before achieving a parallel grasp.

After the object is grasped, we lift HSR’s arm to perform a visual grasp check. We consider a grasp complete if

[TABLE]

where $s_{A,\text{grasp}}$ is the object segmentation size $s_{A}$ (3) during the initial grasp and $s_{A,\text{raised}}$ is the corresponding $s_{A}$ after lifting the arm. If $s_{A}$ decreases when lifting the arm, the object is further from the camera and not securely grasped. Thus, we quickly identify if a grasp is missed and regrasp as necessary. Note that this VOS-based grasp check can also work with other grasping methods [25, 37]. A complete demonstration of our VOS-based visual servo control, depth estimation, and grasping framework is shown in Figure 5.

7 ROBOT EXPERIMENTS

7.1 Experiment Objects

For most of our experiments, we use the objects from the YCB object dataset [9] shown in Figure 7. We use six objects from each of the food, kitchen, tool, and shape categories and purposefully choose some of the most difficult objects. To name only a few of the challenges for the selected objects: dimensions span from the 470 mm long pan to the 4 mm thick washer, most of the contours change with pose, and over a third of the objects exhibit specular reflection of overhead lights. To learn object recognition, we annotate ten training images of each object using HSR’s grasp camera with various object poses, backgrounds, and distances from the camera (see example image in Figure 2).

7.2 Video Object Segmentation Method

We segment objects using OSVOS [8]. OSVOS uses a base network trained on ImageNet [16] to recognize image features, re-trains a parent network on DAVIS [47] to learn general video object segmentation, and then fine tunes for each of our experiment objects (i.e., each object has unique learned parameters $\mathbf{W}$ in (2)). After learning $\mathbf{W}$ , our VOS framework segments HSR’s 640 $\times$ 480 RGB images at 29.6 Hz using a single GPU (GTX 1080 Ti).

7.3 VOS-VS Results

Hadamard-Broyden Update We learn all of the VOS-VS configurations in Table 1 on HSR using the Hadamard-Broyden update formulation in (11). We initialize each configuration using $\widehat{\mathbf{J}_{\mathbf{s}}^{+}}_{t=0}=0.001~{}\mathbf{H}$ , $\alpha=0.1$ , and a target object in view to elicit a step response from the VOS-VS controller (see Figure 8). Each configuration starts at a specific pose (e.g., $\mathbf{H}_{\text{base}}$ uses the leftmost pose in Figures 4-5), and configurations use $s^{*}=[320,240]^{\prime}$ in (12), except for $\mathbf{H}_{\text{base grasp}}$ , which uses $s^{*}=[220,240]^{\prime}$ to position grasps.

When initializing each configuration, after a few iterations of control inputs from (10) and updates from (11), the learned $\widehat{\mathbf{J}_{\mathbf{s}}^{+}}$ matrix generally shows convergence for any $\mathbf{H}_{i,j}$ component that is initialized with the correct sign (e.g., five updates for $\frac{\partial q_{\text{base lateral}}}{\partial s_{y}}$ in Figure 9). Components initialized with an incorrect sign generally require more updates to change directions and jump through zero during one of the discrete updates (e.g., $\frac{\partial q_{\text{base forward}}}{\partial s_{x}}$ in Figure 9). If an object goes out of view from an incorrectly signed component, we reset HSR’s pose and restart the update from the most recent $\widehat{\mathbf{J}_{\mathbf{s}}^{+}}_{t}$ . Once $s^{*}$ is reached, the object can be moved to elicit a few more step responses for fine tuning. Table 1 shows the learned parameters for each configuration. In the remaining experiments, we set $\alpha=0$ in (11) to reduce variability.

$\mathbf{H}_{\text{base}}$ ** Results** We show the step response of all $\widehat{\mathbf{J}_{\mathbf{s}}^{+}}$ configurations in Table 1 by performing experiments centering the camera on objects placed at various viewpoints within each configuration’s starting pose. In Figure 10, both $\mathbf{H}_{\text{base}}$ and $\mathbf{H}_{\text{base grasp}}$ exhibit a stable response. Our motivation to learn two base configurations is the increase in $s_{x,y}$ sensitivity to base motion as an object’s depth decreases. $\mathbf{H}_{\text{base}}$ operates with the camera raised high above objects, while $\mathbf{H}_{\text{base grasp}}$ operates with the camera directly above objects to position for grasping. Thus, $\mathbf{H}_{\text{base}}$ requires more movement than $\mathbf{H}_{\text{base grasp}}$ for the same changes in $s_{x,y}$ . This difference is apparent in Table 1 from $\mathbf{H}_{\text{base}}$ learning greater $\frac{\partial q_{\text{base}}}{\partial s}$ values and in Figure 10 from $\mathbf{H}_{\text{base}}$ ’s smaller $s_{x,y}$ distribution for identical object distances.

$\mathbf{H}_{\text{arm}}$ ** Results** We show the step response of all arm-based VOS-VS configurations in Figure 11. Each configuration uses the same objects and starting pose. Although each configuration segments the pan and baseball, $s^{*}$ is not reachable for these objects within any of the configured actuator spaces; $\mathbf{H}_{\text{arm wrist}}$ is the only configuration to center on all four of the other objects. The overactuated $\mathbf{H}_{\text{arm both}}$ has the most overshoot, while $\mathbf{H}_{\text{arm lift}}$ has the most limited range of camera positions but essentially deadbeat control.

$\mathbf{H}_{\text{head}}$ ** Results** Finally, we show the step response of $\mathbf{H}_{\text{head}}$ in Figure 12. $\mathbf{H}_{\text{head}}$ is the only configuration that uses HSR’s 2-DOF head gimbal and camera, and it exhibits a smooth step response over the entire image. Remarkably, even though $\mathbf{H}_{\text{head}}$ uses the head camera, it still uses the same OSVOS parameters $\mathbf{W}$ that are learned on grasp camera images; this further demonstrates the general applicability of VOS-VS in regards to needing no camera calibration.

7.4 Consecutive Mobile Robot Trials

We perform an experiment consisting of a consecutive set of mobile trials that simultaneously test VOS-VS and VOS-DE. Each trial consists of three unique YCB objects placed at different heights: one on the blue bin 0.25 m above the ground, one on the green bin 0.125 m above the ground, and one directly on the ground (see bin configuration in Figure 2). The trial configurations and corresponding results are provided in Table 2. VOS-VS is considered a success (“X”) if HSR locates and centers on the object for depth estimation. VOS-DE is considered a success if HSR achieves $z_{\text{camera, grasp}}$ (21) such that HSR can close its grippers on the object without hitting the underlying surface and $z_{\text{camera}}$ does not move past the top surface of the object.

Across all 24 objects, VOS-VS has a 83% success rate. VOS-DE, which is only applicable when VOS-VS succeeds, has a 50% success rate. By category, food objects have the highest success (100% VOS-VS, 83% VOS-DE) and kitchen objects have the lowest (50% VOS-VS, 66% VOS-DE). Failures are caused by segmentation errors. Although VOS-VS can center on a poorly segmented object, VOS-DE fails if there are erratic changes in segmentation area (we provide examples in the Appendix). Additionally, VOS-DE’s margin for success varies between objects (e.g., the smallest margin is the 4 mm thick washer).

7.5 Additional Experiments

Pick-and-place Challenges We perform additional experiments for our VOS-based methods, including our work in the TRI-sponsored HSR challenges. These challenges consist of timed trials for pick-and-place tasks with randomly scattered, non-YCB objects (e.g., the banana peel in Figure 13). These challenges are a particularly good demonstration of VOS-VS and VOS-Grasp. We provide additional figures for these experiments in the Appendix.

Dynamic Articulated Objects Finally, we perform additional VOS-VS experiments with dynamic articulated objects. Using $\mathbf{H}_{\text{base}}$ , HSR tracks a plastic chain across the room in real-time as we kick it and throw it in a variety of unstructured poses; we can even pick up the chain and use it the guide HSR’s movements from the grasp camera. In addition, by training OSVOS to recognize an article of clothing, HSR reliably tracks a person moving throughout the room using $\mathbf{H}_{\text{head}}$ (see Figure 13). Experiment videos are available at: https://youtu.be/hlog5FV9RLs.

8 Conclusions and Future Work

We develop a video object segmentation-based approach to visual servo control, depth estimation, and grasping. Visual servo control is a useful framework for controlling a physical robot system from RGB images, and video object segmentation has seen rampant advances within the computer vision community for densely segmenting unstructured objects in challenging videos. The success of our segmentation-based approach to visual servo control in mobile robot experiments with real-world objects is a tribute to both of these communities and the initiation of a bridge between them. Future developments in video object segmentation will improve the robustness of our method and, we expect, lead to other innovations in robotics.

A significant benefit of our segmentation-based framework is that it only requires an RGB camera combined with robot actuation. For future work, we are improving RGB-based depth estimation and grasping by comparing images collected from more robot poses, thereby leveraging more information and making our 3D understanding of the target object more complete.

Acknowledgment Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.

Appendix

Segmentation Errors Densely segmenting unstructured objects is a challenging problem, and, despite using state-of-the-art video object segmentation, we have some segmentation errors during our experiments. Figures 14-17 show segmentations used for depth estimation during the consecutive mobile robot trials in Section 7.4. Even with some segmentation errors, VOS-VS centers on all four objects from the high-camera position and VOS-DE successfully estimates the depth of the plate and drill.

Robot’s Perspective when Learning VOS-VS Figure 18 shows the step-to-step visual servo transitions from the robot’s perspective as it is learning $\widehat{\mathbf{J}_{\mathbf{s}}^{+}}$ for $\mathbf{H}_{\text{base}}$ (corresponding to Figures 8-9).

Figures for Pick-and-place Experiments Figure 19 shows a fully-automated pick-and-place task. Figure 20 shows a pick-and-place task with VOS-VS-based human collaboration.

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. Abolghasemi, A. Mazaheri, M. Shah, and L. Boloni. Pay attention! - robustifying a deep visuomotor policy through task-focused visual attention. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2019.
2[2] J. K. Aggarwal and N. Nandhakumar. On the computation of motion from sequences of images-a review. Proceedings of the IEEE , 76(8):917–935, Aug 1988.
3[3] R. Bajcsy. Active perception. Proceedings of the IEEE (Invited Paper) , 76(8):966–1005, Aug 1988.
4[4] R. Bajcsy, Y. Aloimonos, and J. K. Tsotsos. Revisiting active perception. Autonomous Robots , 42(2):177–196, Feb 2018.
5[5] L. Bao, B. Wu, and W. Liu. CNN in MRF: video object segmentation via inference in A cnn-based higher-order spatio-temporal MRF. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018.
6[6] B. Browatzki, V. Tikhanoff, G. Metta, H. H. Bulthoff, and C. Wallraven. Active in-hand object recognition on a humanoid robot. IEEE Transactions on Robotics , 30(5):1260–1269, Oct 2014.
7[7] C. G. Broyden. A class of methods for solving nonlinear simultaneous equations. Mathematics of Computation , 19(92):577–593, 1965.
8[8] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. One-shot video object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017.