Factored Pose Estimation of Articulated Objects using Efficient Nonparametric Belief Propagation
Karthik Desingh, Shiyang Lu, Anthony Opipari, Odest Chadwicke Jenkins

TL;DR
This paper introduces an efficient nonparametric belief propagation method for estimating the continuous, multi-modal poses of articulated objects in cluttered environments, enabling robots to perceive and manipulate complex jointed objects.
Contribution
It presents a novel factored approach using a pairwise Markov Random Field and the PMPNBP algorithm for accurate articulated pose estimation from RGBD data.
Findings
The method effectively estimates object-part poses in high-dimensional spaces.
It demonstrates convergence over complex articulated scenes.
The approach outperforms existing techniques in pose accuracy.
Abstract
Robots working in human environments often encounter a wide range of articulated objects, such as tools, cabinets, and other jointed objects. Such articulated objects can take an infinite number of possible poses, as a point in a potentially high-dimensional continuous space. A robot must perceive this continuous pose to manipulate the object to a desired pose. This problem of perception and manipulation of articulated objects remains a challenge due to its high dimensionality and multi-modal uncertainty. In this paper, we propose a factored approach to estimate the poses of articulated objects using an efficient nonparametric belief propagation algorithm. We consider inputs as geometrical models with articulation constraints, and observed RGBD sensor data. The proposed framework produces object-part pose beliefs iteratively. The problem is formulated as a pairwise Markov Random Field…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Factored Pose Estimation of Articulated Objects using Efficient Nonparametric Belief Propagation
Karthik Desingh1, Shiyang Lu1, Anthony Opipari1, Odest Chadwicke Jenkins1 1Department of Electrical Engineering and Computer Science, Robotics Institute, University of Michigan, Ann Arbor {kdesingh,shiyoung,topipari,ocj}@umich.edu
Abstract
Robots working in human environments often encounter a wide range of articulated objects, such as tools, cabinets, and other jointed objects. Such articulated objects can take an infinite number of possible poses, as a point in a potentially high-dimensional continuous space. A robot must perceive this continuous pose to manipulate the object to a desired pose. This problem of perception and manipulation of articulated objects remains a challenge due to its high dimensionality and multi-modal uncertainty. In this paper, we propose a factored approach to estimate the poses of articulated objects using an efficient nonparametric belief propagation algorithm. We consider inputs as geometrical models with articulation constraints, and observed RGBD sensor data. The proposed framework produces object-part pose beliefs iteratively. The problem is formulated as a pairwise Markov Random Field (MRF) where each hidden node (continuous pose variable) is an observed object-part’s pose and the edges denote the articulation constraints between the parts. We propose articulated pose estimation by Pull Message Passing algorithm for Nonparametric Belief Propagation (PMPNBP) and evaluate its convergence properties over scenes with articulated objects.
I Introduction
Robots working in human environments often encounter a wide range of articulated objects, such as tools, cabinets, and other kinematically jointed objects. For example, the cabinet with three drawers shown in Fig 1 functions as a storage container. An robot would need to perform open and close actions on the drawers to accomplish storage and retrieval tasks. Accomplishing such tasks involves repeated sense-plan-act phases under uncertainty in the robot’s observations and demands pose estimation that accommodates uncertainty to inform a planner with current state of the world. This uncertainty poses the challenge of dealing with sensor noise and inherent environmental occlusions. Ability to perceive articulated pose under partial observations due to self and environmental occlusions makes the inference problem multi-modal. Further, the inference becomes a high-dimensional problem when the number of object parts grow in number.
Pose estimation methods have been proposed that take a generative approach to the problem [22, 2, 26]. These methods aim to explain a scene as a collection of object/parts poses, using a particle filter formulation to iteratively maintain belief over possible states in the form of particles. Though these approaches hold the power of modeling the world generatively, they have a inherent drawback of being slow with the increase in the number of rigid bodies. In this paper, we focus on overcoming this drawback by factoring the state as individual object parts constrained by their articulations to create an efficient inference framework for pose estimation.
Generative methods exploiting articulation constraints are widely used in human pose estimation problems [17, 21, 24] where human body parts have constrained articulation. We take a similar approach and factor the problem using a Markov Random Field (MRF) formulation where each hidden node in the probabilistic graphical model represents an observed object-part’s pose (continuous variable), each observed node has the information about the object-part from observation and edges of the graph denote the articulation constraints between the parts. Inference on the graph is performed using a message passing algorithm that share the information between the parts’ pose variables, to produce their pose beliefs which collectively gives the state of the articulated object.
Existing message passing approaches [10, 20] represent message as a mixture of Gaussian components and provide Gibbs sampling based techniques to approximate message product and update operations. Their message representation and product techniques limits the number of samples used in the inference and is not applicable to our application domain. In this paper we provide a more efficient “pull” Message Passing algorithm for Nonparametric Belief Propagation (PMPNBP). The key idea of pull message updating is to evaluate samples taken from the belief of the receiving node with respect to the densities informing the sending node. The mixture product approximation can then be performed individually per sample, and then later normalized into a distribution. This pull updating avoids the computational pitfalls of push updating of message distributions used in [10, 20].
Our system takes in 3D point cloud as the sensor data and object geometry models in the form of an URDF (Unified Robot Description Format) as input and outputs belief samples in continuous pose domain. We use these belief samples to compute a maximum likely estimate to let the robot act on the object. We evaluate the performance of the system by quantifying over an articulated object on compelling scenes. Contributions of this paper include: a) proposal of an efficient belief propagation algorithm to estimate articulated object poses, b) discussion and comparisons with traditional particle filter as baseline, c) a belief representation from perception to inform a task planner. A simple task is illustrated to show how the belief propagation informs a task planner to choose an information gain action and overcome uncertainty in the perceptual estimation.
II Related work
Existing methods in the literature have set out to address the challenge of manipulating articulated objects by robots in complex human environments. Particular focus has been placed on addressing the task of estimating novel articulated objects’ kinematic models by a robot through interactive perception. Hausman et al. [7] propose a particle filtering approach to estimate articulation models and plan actions to reduce the model uncertainty. In [12], Martin et al. suggest an online interactive perception technique for estimating kinematic models by incorporating low-level point tracking and mid-level rigid body tracking with high-level kinematic model estimation over time. Sturm et al. [19, 18] addressed the task of estimating articulation models in a probabilistic fashion by human demonstration of manipulation examples.
All of these approaches discover the articulated object’s kinematic model by alternating between action and sensing and are important methods to enable a robot is to reliably interact with novel articulated objects. In this paper we assume that such kinematic models once learned for an object can be reused to localize their articulated pose under real world ambiguous observations. The method proposed in this paper could compliment the existing body of work towards task completion in unstructured human environment.
Probabilistic graphical model representations such as Markov random field (MRF) are widely used in computer vision problems where the variables take discrete labels such as foreground/background. Many algorithms have been proposed to compute the joint probability of the graphical model. Belief propagation algorithms are guaranteed to converge on tree-structured graphs. For graph structures with loops, Loopy Belief Propagation (LBP) [13] is empirically proven to perform well for discrete variables. The problem becomes non-trivial when the variables take continuous values. Sudderth et.al (NBP) [20] and Particle Message Passing (PAMPAS) by Isard et.al [10] provide sampling approaches to perform belief propagation with continuous variables. Both of these approaches approximate a continuous function as a mixture of weighted Gaussians and use local Gibbs sampling to approximate the product of mixtures. NBP has been effectively used in applications such as human pose estimation [17] and hand tracking [21] by modelling the graph as a tree structured particle network. Scene understanding problems where a scene is composed of household objects with articulations demands large number of samples in the representation to handle the high-dimensional multimodal state space. The algorithm proposed in this paper produces promising results to handle such demands. We reported comparisons with existing NBP algorithm [10] in [3] with 2D examples.
Model based generative methods [14, 23, 25] are increasingly being used to solve scene estimation problems where heuristics from discriminative approaches [16, 5] are used to infer object poses. These approaches do not account for object-object interactions or articulations and rely significantly on the effectiveness of recognition. Our framework doesn’t rely on any prior detections but can benefit from them while inherently handling noisy priors [20, 10, 3]. Chua et. al [1] proposed a scene grammar representation and belief propagation over factor graphs, whose objective similar to ours for generating scenes with multiple-objects satisfying the scene grammars. This approach is similar to ours however, we specifically deal with 3D observations along with continuous variables.
III Problem Statement
We consider an articulated object to be comprised of with object-parts and points of articulation. Such an object description conforms with the Unified Robot Description Format (URDF) commonly used in the Robot Operating System (ROS) [15]. Such a URDF-compliant kinematic model can be represented using an undirected graph with nodes for object-part links and edges for points of articulation. If is a Markov Random Field (MRF), it has two types of variables and that are, respectively, hidden and observed variables. Let , where , with being the point cloud observed by the robot’s 3D sensor. Each object-part has an observed node in the graph . serves as a region of interest if a trained object detector is used to find the object in the scene, but is optional in our current approach. Each observed node is connected to a hidden node that represents the pose of the underlying object part. Let , where is a dual quaternion pose of an object-part. Dual quaternions [4, 11] are a quaternion equivalent to dual numbers representing a 6D pose as where is the real component and is the dual component. Alternatively it is represented as . Constructing a dual quaternion is similar to rotation matrices, with a product of dual quaternions representing translation and orientation as , where is a dual quaternion multiplication. is the dual quaternion representation of pure rotation and is the dual quaternion representation of pure translation. This dual quaternion representation is widely used for rigid body kinematics, where the operation due to its efficiency and elegance compared with matrix multiplication. In addition to the representing the hidden variable , dual quaternions can capture the constraints in the edges and represent articulation types such as prismatic, revolute, and fixed effectively. This will be discussed in detail in Section IV-D2.
Pose estimation of the articulated object involves inferring the hidden variables that maximizes the joint probability of the graph considering only second order cliques, which is given as:
[TABLE]
where is the pairwise potential between nodes and , is the unary potential between the hidden node and observed node , and is a normalizing factor. The problem is to infer belief over the possible articulation poses assigned to hidden variables that are continuous such that the joint probability is maximized. This inference is generally performed by passing messages between hidden variables until convergence of their belief distributions over several iterations. After converging over iterations, a maximum likelihood estimate of the marginal belief gives the pose estimate of a object-part corresponding to the node in the graph . The collection of all such object-part pose estimates form the entire object’s pose estimate.
IV Nonparametric Belief Propagation
IV-A Overview
A message is denoted as directed from node to node if there is an edge between the nodes in the graph . The message represents the distribution of what node thinks node should take in terms of the hidden variable . Typically, if is in the continuous domain, then is represented as a Gaussian mixture to approximate the real distribution:
[TABLE]
where , is the number of Gaussian components, is the weight associated with the component, and are the mean and covariance of the component, respectively. We use the terms components, particles and samples interchangeably in this paper. Hence, a message can be expressed as triplets:
[TABLE]
Assuming the graph has tree or loopy structure, computing these message updates is nontrivial computationally. A message update in a continuous domain at an iteration from a node is given by
[TABLE]
where is a set of neighbor nodes of . The marginal belief over each hidden node at iteration is given by
[TABLE]
where is the number of components used to represent the belief.
IV-B “Push” Message Update
NBP [20] provides a Gibbs sampling approach to compute an approximation of the product . Assuming that is pointwise computable, a “pre-message” [9] is defined as
[TABLE]
which can be computed in the Gibbs sampling procedure. This reduces Equation 4 to
[TABLE]
NBP[20] sample from the “pre-message” followed by a pairwise sampling where is acting as to get a sample .
The Gibbs sampling procedure in itself is an iterative procedure and hence makes the computation of the ”pre-message” (as the Foundation function described for PAMPAS) expensive as increases.
IV-C “Pull” Message Update
Given the overview of Nonparametric Belief Propagation above in Section IV-A, we now describe our “pull” message passing algorithm. We represent message as a set of pairs instead of triplets in Equation 3 which is
[TABLE]
Similarly, the marginal belief is summarized as a sample set
[TABLE]
where is the number of samples representing the marginal belief. We assume that there is a marginal belief over as from the previous iteration. To compute the , at iteration , we initially sample from the belief . Pass these samples over to the neighboring nodes and compute the weights . This step is described in IV-A. The computation of is described in IV-B. The key difference between the “push” approach of the earlier methods (NBP [20] and PAMPAS [10]) and our “pull” approach is the message generation. In the “push” approach, the incoming messages to determines the outgoing message . Whereas, in the “pull” approach, samples representing are drawn from its belief from previous iteration and weighted by the incoming messages to . This weighting strategy is computationally efficient. Additionally, the product of incoming messages to compute is approximated by a resampling step as described in IV-B.
IV-D Potential functions
IV-D1 Unary potential
Unary potential is used to model the likelihood by measuring how a pose explains the point cloud observation . The hypothesized object pose is used to position the given geometric object model and generate a synthetic point cloud that can be matched with the observation . The synthetic point cloud is constructed using the object-part’s geometric model available a priori. The likelihood is calculated as
[TABLE]
where is the scaling factor, is the sum of 3D Euclidean distance between the observed point and rendered point at each pixel location in the region of interest.
IV-D2 Pairwise potential and sampling
Pairwise potential gives information about how compatible two object poses are given their joint articulation constraints captured by the edge between them. As mentioned in the Section2, these constraints are captured using dual quaternions. Most often, the joint articulation constraints have minimum and maximum range in either prismatic or revolute types. We capture this information from URDF to get , giving the limits of articulations. For a given and , we find the distance between and the limits as and , as well as the distance between the limits . Using a joint limit kernel parameterized by , we evaluate the pairwise potential as:
[TABLE]
The pairwise sampling uses the same limits to sample for given a . We uniformly sample a dual quaternion that is between and transform it back to the ’s current frame of reference by .
V Experiments and Results
V-A Experimental setup
We use Fetch robot, a mobile manipulation platform for our data collection and manipulation experiments. RGBD data is collected using an ASUS Xtion RGBD sensor mounted on the robot along with the intrinsic and camera to robot base transform. We use CUDA-OpenGL interoperation to render synthetic scenes on large set of poses in a single render buffer on a GPU. We render scenes as depth images, then project them back to 3D point clouds via camera intrinsic parameters.
V-B Articulated Objects Models
We used a cabinet with three drawers as our articulated object in the experiment. CAD model of the object is obtained from the Internet and annotation of their articulations are performed on Blender to generate URDF models. Obtaining geometrical models and articulation models can either be crowd-sourced [6] or learned using human or robot interactions [12].
V-C Baseline
We implemented Monte Carlo localization (particle filter) method that has object specific state representation. For example, the Cabinet with 3 drawers have state representation of where the first 6 elements describe the 6D pose of the object in the world and represent the prismatic articulation. The measurement model in the implementation uses the unary potential described in the Section IV-D1. Instead of rendering a point cloud of each object-part, the entire object in the hypothesized pose is rendered for measuring the likelihood. As the observations are static, the action model in the standard particle filter is replaced with a Gaussian diffusion over the object poses.
V-D Convergence Results
In the Figure. 3, we show the convergence of the proposed method visually for two scenes containing different point cloud observations. We collected point cloud observations of the objects in arbitrary poses and performed inference using both the proposed PMPNBP and the baseline Monte Carlo localization. Entire point cloud observed by the sensor is used as the observation for all the object-parts. The first column shows the scene (RGB not used in the inference). Second column shows the uniformly initialized poses of the object-parts on the entire point cloud. Third column shows the propagated belief particles for each object-part after 100 iterations. Fourth column shows the Maximum Likely Estimate (MLE) of each object-part using the belief particles from the third column.
For the results shown in Figure. 3, we ran our inference for 100 iterations with 400 particles representing the messages. 10 different runs are used to generate the convergence plot that shows the mean and variance in error across the runs. We adopt the average distance metric (ADD) proposed in [8, 25] for the evaluation. The point cloud model of the object-part is transformed to its ground truth dual quaternion () and to the estimated pose’s dual quaternion (). Error is calculated as the pointwise distance of these transformation pairs normalized by the number of points in the model point cloud.
[TABLE]
where () and () are the conjugates of the dual quaternions [4, 11], is the number of 3D points in the model set .
V-E Partial and incomplete observations
Articulated models suffer from self-occlusions and often environmental occlusions. By exploiting the articulation constraints of an object in the pose estimation, our inference method is able to estimate a physically plausible estimate that can explain the partial or incomplete observations. In Figure. 4 we show three compelling cases that indicates the strength of our inference method. In the first case, the drawer 1 heavily occludes the bottom drawers resulting in limited observations on drawer 2 and 3. PMPNBP is able to estimate a plausible pose given the constraints. In the second case, the cabinet is occluded by the robot’s arm, while in the third case, a blanket from the drawer 1 occludes half of the object. PMPNBP is able to recover from these occlusions and produce a plausible estimate along with belief of possible poses.
The factored approach proposed in this paper scales to objects with higher number of links and joints with combinations of articulations. This is evaluated by estimating the pose of a Fetch robot that has 12 nodes and 11 edges in its graphical model. The graphical model is constructed using the URDF model of the robot. This is shown in Figure. 5(c) where the robot is observed using the a depth camera. Figure. 5(a & b) show the original scene and its point cloud observation with partial sensor data on the base, torso and the head of the robot. PMPNBP is able to estimate the pose of the robot by iteratively passing messages for 1000 iterations. Figures. 5(d-f) and Figures. 5(g-i) show the belief samples of the robot links at iteration 1 and 1000 followed by the most likely estimation (MLE) from two different view points for better visualization.
V-F Benefits of maintaining belief towards planning actions
We show how the belief propagation approach aids in planning with a simple task illustration. Assume that the robot is performing a larger task of storing elements into the drawer 3. In a subtask, the goal is to open the drawer 3. With this setting (see Figure. 6) the robot is perceiving the current scene by estimating the pose of the cabinet, along with covariance on the belief for each part. We set a maximum threshold of 0.25cm on the standard deviation of dimensions to decide if the estimation is certain or not. In this case, the standard deviation from the belief falls within this threshold and the robot is certain that the drawer 1 is open and drawer 3 is closed. Hence, the robot performs opening drawer 3 action. For the same task but with a different observation (see Figure. 7), the robot estimates the pose of the cabinet, along with its covariance. However, in this case, the robot is not certain about the estimation as the standard deviation is bigger than the threshold. This enables the robot to take an intermediate action (of lowering its torso) that provides a new observation of the cabinet. With this new observation, the robot perceives that the drawer 3 is closed with more certainty and performs an open action. This is an illustration of how the belief can be used in planning actions. More rigorous experiments with the choice of thresholds for different objects and tasks will be detailed in the future work.
VI Conclusion
We proposed Pull Message Passing algorithm for Nonparametric Belief Propagation (PMPNBP), an efficient algorithm to estimate the poses of articulated objects. This problem was formulated as a graph inference problem for a Markov Random Field (MRF). We showed that the PMPNBP outperforms the baseline Monte Carlo localization method quantitatively. Qualitative results are provided to show the pose estimation accuracy of PMPNBP under a variety of occlusions. We also showed the scalability of the algorithm to articulated objects with higher number of nodes and edges in their probabilistic graphical models. In addition, we illustrated how belief propagation can benefit robot manipulation tasks. The notion of uncertainty in the inference is inevitable in robotic perception. Our proposed PMPNBP algorithm is able to accurately estimate the pose of articulated objects and maintain belief over possible poses that can benefit a robot in performing a task.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Chua and P. F. Felzenszwalb. Scene grammars, factor graphs, and belief propagation. ar Xiv preprint ar Xiv:1606.01307 , 2016.
- 2[2] K. Desingh, O. C. Jenkins, L. Reveret, and Z. Sui. Physically plausible scene estimation for manipulation in clutter. In IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids) , pages 1073–1080, 2016.
- 3[3] K. Desingh, A. Opipari, and O. C. Jenkins. Pull message passing for nonparametric belief propagation. Co RR , abs/1807.10487, 2018.
- 4[4] I. Gilitschenski, G. Kurz, S. J. Julier, and U. D. Hanebeck. A new probability distribution for simultaneous representation of uncertain position and orientation. In Information Fusion (FUSION), 2014 17th International Conference on , pages 1–7. IEEE, 2014.
- 5[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 580–587, 2014.
- 6[6] S. R. Gouravajhala, J. Yim, K. Desingh, Y. Huang, O. C. Jenkins, and W. S. Lasecki. Eureca: Enhanced understanding of real environments via crowd assistance. 2018.
- 7[7] K. Hausman, S. Niekum, S. Osentoski, and G. S. Sukhatme. Active articulation model estimation through interactive perception. In 2015 IEEE International Conference on Robotics and Automation (ICRA) , pages 3305–3312, May 2015.
- 8[8] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision , pages 548–562. Springer, 2012.
