Support Relation Analysis for Objects in Multiple View RGB-D Images

Peng Zhang; Xiaoyu Ge; Jochen Renz

arXiv:1905.04084·cs.CV·May 13, 2019

Support Relation Analysis for Objects in Multiple View RGB-D Images

Peng Zhang, Xiaoyu Ge, Jochen Renz

PDF

Open Access

TL;DR

This paper introduces a novel method for extracting detailed support relations between objects in multi-view RGB-D images using volumetric representations and qualitative reasoning, enhancing understanding of physical scene structure.

Contribution

The method provides a detailed analysis of support relations based on volumetric models, surpassing simple contact graphs and stability approximations in RGB-D scene understanding.

Findings

01

Successfully applied to warehouse scenes with real-world data

02

Accurately identifies true support relations including side contacts and stability contributions

03

Demonstrates improved physical scene understanding over prior contact-based methods

Abstract

Understanding physical relations between objects, especially their support relations, is crucial for robotic manipulation. There has been work on reasoning about support relations and structural stability of simple configurations in RGB-D images. In this paper, we propose a method for extracting more detailed physical knowledge from a set of RGB-D images taken from the same scene but from different views using qualitative reasoning and intuitive physical models. Rather than providing a simple contact relation graph and approximating stability over convex shapes, our method is able to provide a detailed supporting relation analysis based on a volumetric representation. Specifically, true supporting relations between objects (e.g., if an object supports another object by touching it on the side or if the object above contributes to the stability of the object below) are identified. We…

Tables3

Table 1. Table 1 : Some example EIA relations (adding center point to IA)

Relation	Illustration	Inverse Relation
$E I A (A, B) = l o l$		$E I A (B, A) = l o l i$
$E I A (A, B) = m o l$		$E I A (B, A) = m o l i$
$E I A (A, B) = l o m$		$E I A (B, A) = l o m i$
$E I A (A, B) = m o m$		$E I A (B, A) = m o m i$
$E I A (A, B) = m s$		$E I A (B, A) = m s i$
$E I A (A, B) = l s$		$E I A (B, A) = l s i$
$E I A (A, B) = h d$		$E I A (B, A) = h d i$
$E I A (A, B) = c d$		$E I A (B, A) = c d i$

Table 2. Table 2 : Initial guess results

	MSE (Data set 2)
Algorithm 1	1.132e-3
Random initial guess	2.29e-3

Table 3. Table 3 : Support relation results

	Accuracy(Data set 1)	Accuracy(Data set 2)
Our Method	72.5	68.2
Agnostic Panda et al. (2016)	65.0	N/A
Aware Panda et al. (2016)	59.5	N/A

Equations14

x_{c} = \frac{i = 1 Σ n x _{p_{i}}}{n}, y_{c} = \frac{i = 1 Σ n y _{p_{i}}}{n}, z_{c} = \frac{i = 1 Σ n z _{p_{i}}}{n}

x_{c} = \frac{i = 1 Σ n x _{p_{i}}}{n}, y_{c} = \frac{i = 1 Σ n y _{p_{i}}}{n}, z_{c} = \frac{i = 1 Σ n z _{p_{i}}}{n}

r = i \in {1, ..., n} ma x (d i s t (c, p_{i}))

r = i \in {1, ..., n} ma x (d i s t (c, p_{i}))

∣ d (t_{1}, t_{2}) ∣ = ∣ (x_{2}^{+} - x_{1}^{+}) + (x_{2}^{-} - x_{1}^{-}) ∣ + ∣ (y_{2}^{+} - y_{1}^{+}) + (y_{2}^{-} - y_{1}^{-}) ∣

∣ d (t_{1}, t_{2}) ∣ = ∣ (x_{2}^{+} - x_{1}^{+}) + (x_{2}^{-} - x_{1}^{-}) ∣ + ∣ (y_{2}^{+} - y_{1}^{+}) + (y_{2}^{-} - y_{1}^{-}) ∣

d_{q t} = d (t_{1}, t_{1}^{'})

d_{q t} = d (t_{1}, t_{1}^{'})

d_{ba se} (t_{1}, t_{2}) = ⎩ ⎨ ⎧ 0 i f t \in T a r g min (∣ d (t, t_{2}) ∣) = t_{1} 1 i f t \in T a r g min (∣ d (t, t_{2}) ∣) = t_{π /2} 2 i f t \in T a r g min (∣ d (t, t_{2}) ∣) = t_{π} 3 i f t \in T a r g min (∣ d (t, t_{2}) ∣) = t_{3 π /2}

d_{ba se} (t_{1}, t_{2}) = ⎩ ⎨ ⎧ 0 i f t \in T a r g min (∣ d (t, t_{2}) ∣) = t_{1} 1 i f t \in T a r g min (∣ d (t, t_{2}) ∣) = t_{π /2} 2 i f t \in T a r g min (∣ d (t, t_{2}) ∣) = t_{π} 3 i f t \in T a r g min (∣ d (t, t_{2}) ∣) = t_{3 π /2}

\frac{d ( t \in T a r g min ( ∣ d ( t , t _{2} ) ∣ ) , t _{2} ) + 1}{d _{q t} ( t _{1} ) + 1}

\frac{d ( t \in T a r g min ( ∣ d ( t , t _{2} ) ∣ ) , t _{2} ) + 1}{d _{q t} ( t _{1} ) + 1}

\begin{array}[]{ccc}\quad\quad\quad\quad\quad\quad\bm{A_{eq}}\cdot\bm{f}+\bm{w}=\bm{0}\\ \quad\quad\quad\quad\quad\quad\|\bm{f^{n}}\|\geq 0&1)\\ \quad\quad\quad\quad\quad\quad\|\bm{f^{s}}\|\leq\mu\|\bm{f^{n}}\|&2)\par\end{array}

\begin{array}[]{ccc}\quad\quad\quad\quad\quad\quad\bm{A_{eq}}\cdot\bm{f}+\bm{w}=\bm{0}\\ \quad\quad\quad\quad\quad\quad\|\bm{f^{n}}\|\geq 0&1)\\ \quad\quad\quad\quad\quad\quad\|\bm{f^{s}}\|\leq\mu\|\bm{f^{n}}\|&2)\par\end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Robot Manipulation and Learning · Multimodal Machine Learning Applications

Full text

Support Relation Analysis for Objects in Multiple View RGB-D Images

Peng Zhang

Xiaoyu Ge and Jochen Renz

Research School of Computer Science

The Australian National University

{p.zhang, xiaoyu.ge, jochen.renz}@anu.edu.au

Abstract

Understanding physical relations between objects, especially their support relations, is crucial for robotic manipulation. There has been work on reasoning about support relations and structural stability of simple configurations in RGB-D images. In this paper, we propose a method for extracting more detailed physical knowledge from a set of RGB-D images taken from the same scene but from different views using qualitative reasoning and intuitive physical models. Rather than providing a simple contact relation graph and approximating stability over convex shapes, our method is able to provide a detailed supporting relation analysis based on a volumetric representation. Specifically, true supporting relations between objects (e.g., if an object supports another object by touching it on the side or if the object above contributes to the stability of the object below) are identified. We apply our method to real-world structures captured in warehouse scenarios and show our method works as desired.

1 Introduction

Scene understanding for RGB-D images has been extensively studied recently with the availability of affordable cameras with depth sensors such as Kinect Zhang (2012). Among various scene understanding aspects Chen et al. (2016), understanding spatial and physical relations between objects is essential for robotics manipulation tasks Mojtahedzadeh et al. (2013), especially when the target object belongs to a complex structure, which is common in real-world scenes. Although most research on robotics manipulation and planning focuses on handling isolated objects Ciocarlie et al. (2014); Kemp et al. (2007), increasing attention has been paid to the manipulation of physically connected objects (for example Stoyanov et al. (2016); Li et al. (2017)). There are several problems that we need to deal with when analysing more complex object structures. For example, connected objects may remain stable due to support from adjacent objects rather than simple surface support from the bottom. In fact, support force may come from an arbitrary direction. Therefore, a simple bottom-up supporting relation ayalysis is not sufficient. Additionally, objects may hide behind other objects when observing from a certain view point. Given that real-world objects often have irregular shapes, correctly segmenting the objects and extracting their contact relations are challenging tasks. In order to solve these problems, an efficient physical model which deals with objects with arbitrary shapes is required to infer precise support relations of a structure.

In this paper, we propose a framework that takes raw RGB-D images as input and produces detailed support relations between objects in a stack. Most existing work on similar topics either assumes object shapes to be simple convex shapes Shao et al. (2014), such as cuboid and cylinder or makes use of previous knowledge of the objects in the scene Silberman et al. (2012); Song et al. (2016) to simplify the support analysis process. Although reasonable experimental results were demonstrated, those methods usually lack the capability of dealing with scenes that contain a lot of unknown objects. As a significant difference to existing methods, our proposed method does not assume any knowledge about the objects in a scene. After individually segmenting point clouds of each view of the scene, our method builds a volumetric representation based on Octree Laboratory and Meagher (1980) for each view with information about hidden voxels. The octree of the whole scene combined from the input views is then constructed using spatial reasoning about the objects. This process allows us to precisely register input point clouds from different views and provide a reliable contact graph integrating all views that can then be used for a proper support relation analysis. We adopt an intuitive physical model to determine the overall stability of the structure. By iteratively removing contact force between object pairs, we can infer supporters of each object and then build the support graph. To the best of our knowledge, this is the first work that is able to explain the object support relations from a physical perspective.

2 Related Work

There has been work on scene understanding about support relations from a single view RGB-D image in both computer vision and robotics. In computer vision, scene understanding helps to produce more accurate detailed segmentation results. The work described in Jia et al. (2013) applied an intuitive physical model to qualitatively infer support relations between objects and the experimental results showed the improvement of segmentation results on simple structures. In Silberman et al. (2012), the types of objects in indoor scenes were determined by learning from a labeled data set to get more accurate support relations. The above-mentioned papers both took a single image as input which limited the choice of the physical model since a significant amount of hidden information was not available. Shao et al. (2014) attempted to recover unknown voxels from single view images by assuming the shape of the hidden objects to be cuboid and use static equilibrium to approximate the volume of the incomplete objects. In robotics, Mojtahedzadeh et al. (2013) proposed a method to safely de-stack boxes based on geometric reasoning and intuitive mechanics, which was shown to be effective in their later work Stoyanov et al. (2016). In Li et al. (2017), a simulation based method was proposed to infer stability during robotics manipulation on cuboid objects. This method includes a learning process using a large set of generated simulation scenes as a training set.

Humans look at things from different angles to gather comprehensive information for a better understanding. For example, before a jenga player takes an action, the player will usually look around the stack from several critical views to have an overall understanding of the scene. This also applies to robots when they take images as the input source. A single input image provides incomplete information. Even when the images are taken from different views of the same static scene, the information may still be inadequate for scene understanding when using quantitative models for inferring detailed physical and spatial information, as this requires precise input. Qualitative reasoning has been demonstrated to be more suitable for modeling incomplete knowledge Kuipers (1989). There are various qualitative calculi for representing different aspects of spatial entities Randell et al. (1992); Liu et al. (2009); Ligozat (1998); Guesgen (1989); Lee et al. (2013). One qualitative calculus that seems particularly useful for reasoning about spatial structures and their stability is the Extended Rectangle Algebra (ERA) Zhang and Renz (2014) which simplifies the idea in Ge and Renz (2013) to infer stability of 2D rectangular objects. It is possible to combine ERA with an extended version of cardinal direction relations Navarrete and Sciavicco (2006) to qualitatively represent detailed spatial relations between objects, which helps to infer the transformation between two views. It is worth mentioning that Panda et al. (2016) proposed a framework to analyze support order of objects from multiple views of a static scene, yet this method requires relatively accurate image segmentation and the order of the images for object matching.

Models for predicting stability of a structure have been studied for many decades. Fahlman Fahlman (1974) proposed a model to analyze system stability based on Newton’s Laws. Simulation based models were also presented in recent years Cholewiak et al. (2013); Li et al. (2017). However, Davis and Marcus (2016) argues that probabilistic simulation based methods are not suitable for automatic physical reasoning due to some limitations including the lack of capability to handle imprecise input. Thus in our approach, we aim to apply qualitative spatial reasoning to combine raw information from multiple views to extract understandable and more precise relations between objects in the environment.

3 Method Pipeline

We now describe the overall pipeline of our support relation extraction method, which consists of three modules: image segmentation, view registration and stability analysis.

The image segmentation module takes a set of RGB-D images taken from different views of a static scene as input. To retain generality of our method, we do not assume any pre-known shapes of objects in the scene, that is, we do not use template matching methods that can provide more accurate segmentation results nor machine learning methods which require large amount of training data. This setting makes our method applicable in unknown environments. In the implementation, the raw rgbd data is first processed by a stream of morphology operations as described in Ku et al. (2018) in order to fill the holes in the depth map. Notably, this hole-filling algorithm does not require any pre-training which is consistent with the no-prior-knowledge assumption in this paper. Then we use LCCP Stein et al. (2014) for point cloud segmentation. LCCP first represents the point cloud as a set of connected supervoxels Papon et al. (2013). Then the supervoxels are segmented into larger regions by merging convexly connected supervoxels. Each point cloud of a view will be segmented into individual regions. We use a connected graph to represent relations between the regions. Each graph node is a segmented region. The contact graph is then used to identify contact relation between objects in the structure. We use Manhattan world Furukawa et al. (2009) assumption to find the ground plane. The entire scene will then be rotated such that the ground plane is parallel to the flat plane. Details about segmentation and ground plane detection will not be discussed as we used this method with little change. Fig 1 shows a typical output from this module.

In the view registration module, we use the iterative closest point (ICP) algorithm Besl and McKay (1992) to find the transformation between two point clouds. Notably, the initial guess for ICP algorithm is crucial. A bad initial guess may lead the registration to a local minima which provides incorrect results Pomerleau et al. (2015). Due to the nature of multiple objects involving in the scene, we propose an algorithm to find an initial match for point clouds based on spatial relations between the objects. A matching between objects from two views will also be provided by this algorithm. The contact graph of each single point cloud will then be combined to produce a contact relation graph over all input images after the registration of different views.

In the stability analysis module, we adopt the definition of structural stability Livesley (1978) and analyze static equilibrium of the structure by representing reacting forces at each contact area as a system of equations. A structure is considered stable if the equations have a solution. Given a static input scene, several schemes will be used to adjust the unseen part of the structure to make the static equilibrium hold.

The contribution of this paper is bi-fold. First, we introduce a qualitative reasoning method to extract spatial relations between objects in a stack. We apply this information to find proper initial guess of the ICP algorithm to demonstrate the its usefulness. Second, we propose a method for reconstructing volumetric model of objects with no prior knowledge about objects, which is then used to analyse the true support relation of the object stack.

4 View Registration

In this section, we introduce a qualitative spatial reasoning approach to match objects from two scenes in order to find a proper initial guess for ICP to register the point clouds. In subsection 4.1, the qualitative spatial calculi and definitions related to the initial guess estimation algorithm are introduced first. In subsection 4.2, the algorithm is explained in detail.

4.1 Preliminaries on Qualitative Spatial Reasoning

The extended rectangle algebra (ERA) Zhang and Renz (2014) is a qualitative spatial calculus which can be used to reason about the structural stability of connected 2D rectangular objects. For our problem, ERA is not expressive enough as the objects are incomplete 3D entities with irregular shapes. In section 3, we mentioned that the ground plane has been detected under the Manhattan space assumption, thus it is possible to analyze spatial relations separately from vertical and horizontal directions. Although we do not assume all images be taken from the same height relative to the ground, it is reasonable to assume that images are taken from a human-eye view, not a birds-eye view. As the ground plane is detected, vertical spatial relations become stable to view changes. In contrast, horizontal spatial relations change dramatically when the view point changes. In order to analyze the horizontal spatial relations independently, all regions are projected onto the ground plane, i.e., a 2D Euclidean space.

ERA relations can be represented using extended interval algebra (EIA) relations (see table 1) in each dimension in a 2D Euclidean space. EIA corresponds to Allen’s interval algebra Allen and Koomen (1983) with an additional center point for each interval. As a result, EIA has 27 basic relations (denoted by $B_{eint}$ ) which produce $27^{2}$ ERA relations (see Zhang and Renz (2014) for formal definitions of ERA). The $ERA$ relation for two regions $A$ and $B$ can be written as $ERA(A,B)=(EIA_{x}(A,B),EIA_{y}(A,B))$ . We will infer changes of direction relations with respect to horizontal view changes by applying ERA.

Definition 1 (region, region centroid, region radius)

Given a raw point cloud $PC$ . Region $a_{l}$ is the set of points ${p_{1},...,p_{n}}\in PC$ with the same label $l$ from a segmentation algorithm. Let $c$ = ( $x_{c}$ , $y_{c}$ , $z_{c}$ ) denote the region centroid, where

[TABLE]

$dist(c,p_{i})$ * denotes the Euclidean distance between region centroid $c$ and an arbitrary point $p_{i}\in a_{l}$ . The region radius $r$ of a region $a$ is:*

[TABLE]

Let $mbr$ denote the minimal bounding rectangle of a region. The $mbr$ will change with the change of views. As a result, the $EIA$ relation between two regions will change accordingly. By analyzing the $EIA$ change, an approximate horizontal rotation level can be determined between two views. Before looking at incomplete regions due to occlusion or noise from the sensor, we first research how the values of $r_{x}^{-},r_{x}^{+},r_{y}^{-},r_{y}^{+}$ change assuming the regions are completely sensed. We identify a conceptual neighborhood graph of $EIA$ which includes all possible one-step changes with respect to horizontal rotation of views (see figure 2).

Definition 2 (view point, change of view)

The view point $v$ is the position of the camera. Let $v_{1}$ and $v_{2}$ denote two view points, and $c$ be the region centroid of the sensed connected regions excluding the ground plane. Assuming the point cloud has been rotated such that the ground plane is parallel to the plane defined by x-axis and y-axis of a 3D coordination system. Let $v_{1xy}$ , $v_{2xy}$ and $c_{xy}$ be the vertical projection of $v_{1}$ , $v_{2}$ and $c$ to the xy plane. The change of view $C$ is the angle difference between the line segments $c_{xy}v_{1xy}$ and $c_{xy}v_{2xy}$

Definition 3 (symmetric EIA relation)

Let $R\in B_{eint}$ to be an arbitrary EIA atomic relation. The symmetric EIA relation of $R$ (denoted by $symm(R)$ ) is defined as $R$ ’s axially symmetric atomic relation against the axis of symmetry formed by relations { $cd,~{}eq,~{}cdi$ } in the conceptual neighborhood graph given in figure 2. For example, $symm(mol)=lomi$ . The symmetric relation of $cd$ , $eq$ and $cdi$ are themselves.

Lemma 1

*Let $C_{cw\pi/2}$ denote the view change of $\pi/2$ clockwise from view point $v_{1}$ to $v_{2}$ . Let $ERA_{ab_{1}}=(r_{x1},r_{y1})$ and $ERA_{ab_{2}}=(r_{x2},r_{y2})$ denote the $ERA$ relations between region $a$ and $b$ at $v_{1}$ and $v_{2}$ .

Assuming the connected regions are fully sensed, then $r_{x2}=symm(r_{y1})$ , $r_{y2}=r_{x1}$ Similarly, if the view changes by $\pi/2$ anticlockwise, then $r_{x2}=r_{y1}$ , $r_{y2}=symm(r_{x1})$ *

Proof: Lemma 1 can be simply proved by reconstructing a coordination system at each view point.

Although the conceptual neighborhood graph indicates possible relation changing path for a pair of objects in one dimension, the way the changes happen depends on the rotation direction (clockwise or anti-clockwise) and their $ERA$ relation before the rotation. For example, $ERA(A,B)=(m,m)$ means $mbr(A)$ connects $mbr(B)$ at the bottom-left corner of $mbr(B)$ , thus if the view rotates anti-clockwise, $mbr(A)$ tends to move upwards related to $mbr(B)$ regardless of the real shape of $A$ and $B$ , therefore $EIA_{y}(A,B)$ will change from ‘ $m$ ’ to ‘ $lol$ ’ but not ‘ $b$ ’. To determine the changing trend of $ERA$ relations more efficiently, we combine $ERA$ with cardinal direction relations (CDR) which describes how one region is relative to the other in terms of directional position.

The basic $CDR$ Skiadopoulos and Koubarakis (2004) contains nine cardinal tiles as shown in figure 3. ‘ $B$ ’ represents the relation ‘ $belong$ ’, the other eight relations the cardinal directions N (north), NE (north-east), etc.

Definition 4 (basic CDR relation)

A basic CDR relation is an expression $R_{1}:...:R_{k}$ with $1\leq k\leq 9$ where:

$R_{1},...,R_{k}\in\{B,N,NE,E,SE,S,SW,W,NW\}$ ** 2. 2.

$R_{i}\neq R_{j},~{}\forall 1\leq i,j\leq k,i\neq j$ ** 3. 3.

$\forall b\in REG,~{}\exists a_{1},...,a_{k}\in REG~{}and~{}a_{1}\cup...\cup a_{k}\in REG$ * (Regions that are homeomorphic to the closed unit disk ${(x,y):x^{2}+y^{2}\leq 1}$ are denoted by REG).*

If $k=1$ , the relation is called a single-tile relation and otherwise a multi-tile relation.

Similar to the extension from $RA$ to $ERA$ , we introduce center points to extend basic $CDR$ to the extended $CDR$ (denoted as $ECDR$ , see figure 3) in order to express inner relations and detailed outer relations between regions. Notably, a similar extension about inner relations of CDR was proposed in Liu et al. (2005). However, as we focus on $mbr$ of regions in this problem, their notions for inner relations with trapezoids are not suitable for our representation.

In Navarrete and Sciavicco (2006), rectangular cardinal direction relations (RCDR) which combines $RA$ and $CDR$ was studied. As a subset of CDR, RCDR considers single-tile relations and a subset of multi-tile relations that represent relations between two rectangles whose edges are parallel to the two axes. We combine $ERA$ and $ECDR$ in similar way to produce the extended rectangular cardinal direction relations (ERCDR). Including all single-tile and multi-tile relations, there exists 100 valid relations (not all listed in this paper) to represent the directional relation between two $mbrs$ .

4.2 Initial Guess Estimation for ICP

In this section, the algorithm for matching objects between two views is proposed. In section 3, the point cloud has been aligned to the direction of the ground, therefore, the only two factors for initial transformation estimation are rotation against the vertical axis and the translation. With the matched objects, the task for estimating initial transformation between two point clouds for ICP algorithm is then minimising the sum of distance between all matched object pairs.

Informally, assuming a spatial relation graph is built for any two objects in the same view. If most objects in one view are correctly matched to the corresponding ones in the other view, the two spatial relation graph can be very similar (if not identical because of incomplete input) by rotating one view by a certain angle. For example, in figure 4, the spatial relation graph for the left view is {west( $a_{1}$ , $b_{1}$ ), northwest( $a_{1}$ , $c_{1}$ ), northwest( $a_{1}$ , $c_{1}$ )}, and for the right view is {east( $a_{2}$ , $b_{2}$ ), southeast( $a_{2}$ , $c_{2}$ ), southeast( $a_{2}$ , $c_{2}$ )}. If correctly matching all three pairs of objects, the identical graph can be obtained by rotating the right view by $\pi$ /2 from any direction. If wrongly matching any objects, the identical graph can never be obtained.

Therefore, a distance function is necessary for measuring the similarity of two relation graphs. A method about how to calculate the distance between two ERCDR relations is shown below.

Definition 5 (directional property of a single-tile relation)

Each tile $t$ in ECDR has both a horizontal directional property $HDP(t)\in\{E,W\}$ and a vertical directional property $VDP(t)\in\{N,S\}$ . The value is determined by the relative directional relation between the centroid of $t$ and the centroid of the reference region $b$ . For example, $HDP(INE)=E$ and $VDP(INE)=N$ .

Definition 6 (directional property of multi-tile relation)

The directional property of a multi-tile relation $mt$ is determined by majority single tile directional properties in the multi-tile relation, where $HDP(mt)\in\{E,M,W\}$ and $VDP(mt)\in\{N,M,S\}$ , where $M$ represents ‘middle’ which appears when the counts of the single tile directional properties are equal. For example, $HDP(WMN:EMN:INW:INE)=M$ and $VDP(WMN:EMN:INW:INE)=N$ .

Directional property can be used to estimate the trend of the relation change. Let a view point rotate clockwise, if HDP and VDP are as observed, then the change trend is as described in the following table:

[TABLE]

One ERCDR relation can be transformed to the other by lifting the four bounding lines of the single/multi-tile. The distance $d$ between two ERCDR relations is calculated from horizontal and vertical directions by counting how many grids each boundary line lifts over. The distance is related to the direction of the view point changes as well as the directional properties of the region. However, with the same angle difference, the inner tiles take much fewer changes than the outside tiles. We introduce the quarter distance size to represent the angle change of $\pi/2$ for normalizing the distance between ERCDR relation pairs corresponding to the angle difference.

Based on lemma 1, we can infer the ERA relation between two regions after rotating the view point by $\pi/2$ , $\pi$ and $3\pi/2$ either clockwise or anticlockwise. The ERA relation can be then represented by an ERCDR tile. In order to calculate the distance between ERCDR tiles, we label each corner of the single tiles with a 2D coordinate with bottom-left corner of tile $SW$ to be the origin $(0,0)$ (see figure 3).

Definition 7 (distance between ERCDR tiles)

Let $t_{1}$ and $t_{2}$ be two ERCDR tiles. $x_{1}^{-}$ and $x_{1}^{+}$ are the left and right bounding lines of $t_{1}$ , $y_{1}^{-}$ and $y_{1}^{+}$ the top and bottom bounding lines of $t_{1}$ . $x_{2}^{-}$ and $x_{2}^{+}$ are the left and right bounding lines of $t_{2}$ . $y_{2}^{-}$ and $y_{2}^{+}$ the top and bottom bounding lines of $t_{2}$ .

The unsigned distance between $t_{1}$ and $t_{2}$ is calculated as:

[TABLE]

The sign of $d$ is determined by whether the change trend suggested by directional property is followed. If so, $d(t_{1},t_{2})$ = $|d(t_{1},t_{2})|$ , else $d(t_{1},t_{2})$ = $-|d(t_{1},t_{2})|$ . If there is no or symmetric change on the trend direction, $d(t_{1},t_{2})=0$

If there is a significant angle difference between the two tiles, the distance may not be accurate due to multiple path for the transformation. Here we introduce three more reference tiles by rotating the original tile by $\pi/2$ , $\pi$ and $3\pi/2$ in turn.

Definition 8 (quarter distance)

Let $t_{1}$ be an ERCDR tile and $t_{1}^{\prime}$ be the ERCDR tile produced by rotating $t_{i}$ by $\pi/2$ or $-\pi/2$ . By applying lemma 1, the reference tiles can be easily mapped to ERCDR tiles. The quarter distance of $t_{1}$ is defined as:

[TABLE]

Definition 9 (normalized distance)

Let $t_{1}$ be an ERCDR tile. $t_{\pi/2}$ , $t_{\pi}$ and $t_{3\pi/2}$ denote the three reference tiles for $t_{1}$ . Let $t_{2}$ be another ERCDR tile. The normalized distance $d_{norm}(t_{1},t_{2})$ will be calculated in two parts:

The base distance $d_{base}$ . Let $T=\{t_{1},t_{\pi/2},t_{\pi},t_{3\pi/2}\}$ .

[TABLE] 2. 2.

*The normalized distance $d_{norm}(t_{1},t_{2})=d_{base}(t_{1},t_{2})+$ *

**

[TABLE]

Having the normalized distance for calculating the similarity between two ERCDR tiles, we now show the algorithm for identifying proper matching between objects from two different views in algorithm 1.

Once the match of objects has been determined, the initial transformation can be calculated by performing local search of horizontal rotation to obtain a minimal sum of distance between two relation graphs. Than a translation is also calculated by minimising the euclidean distance of the geometric centers of the matched objects. The ICP algorithm will be performed using the calculated initial transformation.

5 Stability Analysis

In this section, we introduce a method to complete the object with invisible voxels. Then, we show how to get support relation from the volumetric representation of the objects.

5.1 Object Completion

With the registration of muptiple views, an octree representation of the scene is built. The next step is to classify invisible voxels to the objects in order to analyse the support relation. First, for each of the object, an oriented minimal bounding box OMBB is calculated. We perform Ransac Fischler and Bolles (1981) algorithm to fit the largest plane to the object point cloud. This plane is used as one surface and the OMBB of the object is them determined. All invisible voxels in the OMBB are then assigned to this object. As the octree is built with point clouds from different views, the set of invisible voxels are largely eliminated and the volumetric model tends to represents the intrinsic shape of the object. Figure 5 shows the process of object completion in 2D for simplicity.

5.2 Support Relation Analysis

We use a modified version of the structural analysis method in Ge et al. (2017). A structure is in static equilibrium when the net force and net torque of the structure equal to zero. The static equilibrium is expressed in a system of linear equations Whiting et al. (2009):

[TABLE]

$A_{eq}$ is the coefficient matrix where each column stores the unit direction vectors of the forces and the torque at a contact point. To identify the contact points between two contacting objects, we first fit a plane to all points of the connected regions between the objects. We then project all points to the plane and obtain the minimum oriented bounding rectangle of the points. The resulting bounding rectangle approximates the region of contact, and the four corners of the rectangle will be used as contact points. $f$ is a vector of unknowns representing the magnitude of each force at the corresponding contact vertex. The forces include contact forces $\bm{f^{n}}$ and friction forces $\bm{f^{s}}$ at the contact vertex. The constraint 1) requires the normal forces to be positive and constraint 2) requires the friction forces comply with the Coulomb model where $\mu$ is the coefficient of static friction. A structure is stable when there is a solution to the equations.

Using the structural analysis method, we can identify support relations between objects. Specifically, we are interested in identifying the core supporters Ge et al. (2016) of each object in a scene. An object $o_{1}$ is a core supporter of another object $o_{2}$ if $o_{2}$ becomes unstable after removal of $o_{1}$ . Given a contact between $o_{1}$ and $o_{2}$ , to test whether $o_{1}$ is the core supporter of $o_{2}$ , we first identify the direction vectors of forces and torque given by the contact on $o_{2}$ , and set them to zero in Eq. 7. This is equivalent to removing all forces that $o_{1}$ imposes on $o_{2}$ . If the resulting Eq. 7 has no solution, then $o_{1}$ is the core supporter. We test each pair of objects in a scene and obtain a support graph, which is defined as a directed graph with each vertex representing an object. There is an edge from $v_{1}$ to $v_{2}$ if $o_{1}$ is a core supporter of $o_{2}$ .

6 Experiments

We first test our method about estimation of initial guess for ICP algorithm. Then we show the method’s capability of identifying core supporters of an object in a structure. For both experiments, we test our method on two data sets as well as some single scenes for testing special configurations. Data set 1 is from Panda et al. (2016) which contains seven different scenes. Data set 2 is taken from a warehouse scenario of a real logistics application setting in sorting parcels. This data set consists of 5 scenes.

6.1 Initial Guess Estimation of ICP

In this experiment, we compare the initial guess from algorithm 1 with random initial guess for ICP point cloud registration. Figure 6 shows the result qualitatively. In table 2, we use the mean sum of squared error (MSE) for all point pairs to evaluate the quality of the registration on data set 2 which consists of more complex scenes.

6.2 Support Graph Evaluation

For core supporter detection, we show that our method out performs the method in Panda et al. (2016) on data set 1. As the method in Panda et al. (2016) requires precise object models for segmentation, it does not work in unknown scenarios in data set 2. Therefore, only algorithm 1 is tested on data set 2. The reason why support relation accuracy is slightly low is that there are more errors from segmentation which provide more false positive support relations.

In addition to data set 1 and 2, we use a single scene with special supporting relations (e.g. the top object supports the bottom object) as well as the data from Panda et al. (2016). Table 3 shows the core supporter detection results compared with the methods in Panda et al. (2016). Our algorithm is able to find most of the true support relations. In addition, our method is able to detect some special core supporter objects such as A in Figure 7 which is difficult to be detected by statistical methods.

In Figure 7, results of core supporter detection are presented. Notably, in the second row, we detected that even though object C is on top of object B, it contributes to the stability of B. Thus, C is a core supporter of B as well as A.

7 Conclusion and Future Work

In this paper, we propose a framework for identifying support relations among a group of connected objects taking a set of RGB-D images about the same static scene from different views as input. We assume no knowledge about the objects and the environment beforehand. By qualitatively reasoning about the angle change between each pair of input images, we successfully identified matching of the objects between different views and calculate the initial guess for ICP algorithm. We use static equilibrium to analyse the stability of the whole structure and extract the core support relation between objects in the structure. We can successfully detect most of the support relations. With the capability of analysing core supporting relations, the perception system is able to assist the AI agent to perform causal reasoning about consequences of an action applied on an object in a structure. Apparently this is only one aspect of physical relations that can be derived. In the future, more object features (e.g. solidity, density distribution, etc.) and relations between objects (e.g. containment, relative position, etc.) can be studied.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Allen and Koomen [1983] J. F. Allen and J. A. Koomen. Planning using a temporal world model. In IJCAI 1983 , pages 741–747. Morgan Kaufmann Publishers Inc., 1983.
2Besl and Mc Kay [1992] Paul J Besl and Neil D Mc Kay. Method for registration of 3-d shapes. In Sensor Fusion IV: Control Paradigms and Data Structures , volume 1611, pages 586–607. International Society for Optics and Photonics, 1992.
3Chen et al. [2016] K. Chen, Y. Lai, and S. Hu. 3d indoor scene modeling from rgb-d data: a survey. CVM , 1(4):267–278, 2016.
4Cholewiak et al. [2013] S. A. Cholewiak, R. W. Fleming, and M. Singh. Visual perception of the physical stability of asymmetric three-dimensional objects. JOV , 13(4):12–12, 2013.
5Ciocarlie et al. [2014] M. Ciocarlie, K. Hsiao, E. G. Jones, S. Chitta, R. B. Rusu, and I. A. Şucan. Towards reliable grasping and manipulation in household environments. In ISER 2014 , pages 241–252. Springer, 2014.
6Davis and Marcus [2016] E. Davis and G. Marcus. The scope and limits of simulation in automated reasoning. AIJ , 233:60–72, 2016.
7Fahlman [1974] S. E. Fahlman. A planning system for robot construction tasks. AIJ , 5(1):1–49, 1974.
8Fischler and Bolles [1981] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM , 24(6):381–395, 1981.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Support Relation Analysis for Objects in Multiple View RGB-D Images

Abstract

1 Introduction

2 Related Work

3 Method Pipeline

4 View Registration

4.1 Preliminaries on Qualitative Spatial Reasoning

Definition 1** (region, region centroid, region radius)**

Definition 2** (view point, change of view)**

Definition 3** (symmetric EIA relation)**

Lemma 1

Definition 4** (basic CDR relation)**

4.2 Initial Guess Estimation for ICP

Definition 5** (directional property of a single-tile relation)**

Definition 6** (directional property of multi-tile relation)**

Definition 7** (distance between ERCDR tiles)**

Definition 8** (quarter distance)**

Definition 9** (normalized distance)**

5 Stability Analysis

5.1 Object Completion

5.2 Support Relation Analysis

6 Experiments

6.1 Initial Guess Estimation of ICP

6.2 Support Graph Evaluation

7 Conclusion and Future Work

Definition 1 (region, region centroid, region radius)

Definition 2 (view point, change of view)

Definition 3 (symmetric EIA relation)

Definition 4 (basic CDR relation)

Definition 5 (directional property of a single-tile relation)

Definition 6 (directional property of multi-tile relation)

Definition 7 (distance between ERCDR tiles)

Definition 8 (quarter distance)

Definition 9 (normalized distance)