Real-time Background-aware 3D Textureless Object Pose Estimation
Mang Shao, Danhang Tang, Tae-Kyun Kim

TL;DR
This paper introduces a real-time 3D object pose estimation method using a modified fuzzy decision forest with background rejection, achieving high efficiency and scalability while maintaining accuracy.
Contribution
It proposes a novel background rejector node in the fuzzy decision forest for faster, scalable 3D pose estimation from templates.
Findings
Outperforms state-of-the-art in efficiency
Maintains comparable accuracy
Scales well to large datasets
Abstract
In this work, we present a modified fuzzy decision forest for real-time 3D object pose estimation based on typical template representation. We employ an extra preemptive background rejector node in the decision forest framework to terminate the examination of background locations as early as possible, result in a significantly improvement on efficiency. Our approach is also scalable to large dataset since the tree structure naturally provides a logarithm time complexity to the number of objects. Finally we further reduce the validation stage with a fast breadth-first scheme. The results show that our approach outperform the state-of-the-arts on the efficiency while maintaining a comparable accuracy.
| 1 Tree | T_Tree | T_Valid | T_Total | Acc. | 5 Trees | T_Tree | T_Valid | T_Total | Acc. |
|---|---|---|---|---|---|---|---|---|---|
| ape | 0.20 ms | 6.50 ms | 6.70 ms | 96.0% | 0.99 ms | 12.31 ms | 13.30 ms | 97.1% | |
| bvise | 0.43 ms | 13.37 ms | 13.80 ms | 91.1% | 2.13 ms | 29.50 ms | 31.63 ms | 93.2% | |
| cam | 0.41 ms | 11.70 ms | 12.11 ms | 93.1% | 1.93 ms | 24.66 ms | 26.59 ms | 94.8% | |
| can | 0.44 ms | 13.07 ms | 13.51 ms | 91.5% | 2.22 ms | 28.45 ms | 30.67 ms | 92.0% | |
| cat | 0.23 ms | 8.34 ms | 8.57 ms | 94.3% | 1.05 ms | 15.23 ms | 16.28 ms | 95.5% | |
| driller | 0.39 ms | 13.93 ms | 14.32 ms | 95.4% | 1.73 ms | 29.01 ms | 30.74 ms | 96.0% | |
| duck | 0.28 ms | 9.64 ms | 9.92 ms | 90.0% | 1.20 ms | 15.99 ms | 17.19 ms | 94.5% | |
| eggbox | 0.30 ms | 10.73 ms | 11.03 ms | 98.3% | 1.18 ms | 20.13 ms | 21.31 ms | 98.9% | |
| glue | 0.34 ms | 11.46 ms | 11.80 ms | 92.1% | 1.62 ms | 29.85 ms | 31.47 ms | 94.4% | |
| hpunch | 0.44 ms | 14.76 ms | 15.20 ms | 90.7% | 2.19 ms | 33.12 ms | 35.31 ms | 93.6% | |
| iron | 0.39 ms | 11.82 ms | 12.21 ms | 91.9% | 1.67 ms | 19.38 ms | 21.05 ms | 92.7% | |
| phone | 0.40 ms | 13.93 ms | 14.33 ms | 89.8% | 1.97 ms | 29.44 ms | 31.41 ms | 91.0% | |
| Average | 0.35 ms | 11.60 ms | 11.96 ms | 92.9% | 1.66 ms | 23.92 ms | 25.58 ms | 94.4% | |
| Hashmod[11] | - | - | 83 ms | 96.5% | DTT-3D[15] | - | - | 55 ms | 97.2% |
| LineMOD[9] | - | - | 119 ms | 96.6% |
| 5 Objects | 15 Objects | |||
| 1 Object | T_Total | Acc. | T_Total | Acc. |
| ape | 20.01 ms | 97.3% | 25.53 ms | 97.4% |
| bvise | 50.30 ms | 94.4% | 60.11 ms | 93.4% |
| cam | 59.81 ms | 94.7% | 64.47 ms | 94.0% |
| can | 53.54 ms | 93.3% | 66.98 ms | 93.1% |
| cat | 34.09 ms | 95.2% | 45.22 ms | 96.0% |
| driller | 66.40 ms | 96.4% | 78.94 ms | 95.8% |
| duck | 29.19 ms | 95.5% | 42.23 ms | 96.2% |
| eggbox | 34.50 ms | 98.8% | 47.98 ms | 98.6% |
| glue | 52.19 ms | 94.7% | 65.99 ms | 94.6% |
| hpunch | 56.32 ms | 95.2% | 74.56 ms | 95.3% |
| iron | 32.13 ms | 93.2% | 44.72 ms | 93.6% |
| phone | 54.71 ms | 93.3% | 56.10 ms | 93.3% |
| Average | 45.27 ms | 95.2% | 56.07 ms | 95.1% |
| Hashmod[11] | 131 ms | 95.5% | 184 ms | 95.1% |
| DTT-3D[15] | 107 ms | 97.2% | 239 ms | 97.2% |
| LineMOD[9] | 427 ms | 96.6% | 1197 ms | 96.6% |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques
Real-time Background-aware 3D Textureless Object Pose Estimation
Mang Shao
Imperial College London
Danhang Tang
Imperial College London
Tae-Kyun Kim
Imperial College London
Abstract
In this work, we present a modified fuzzy decision forest for real-time 3D object pose estimation based on typical template representation. We employ an extra preemptive background rejector node in the decision forest framework to terminate the examination of background locations as early as possible, result in a significantly improvement on efficiency. Our approach is also scalable to large dataset since the tree structure naturally provides a logarithm time complexity to the number of objects. Finally we further reduce the validation stage with a fast breadth-first scheme. The results show that our approach outperform the state-of-the-arts on the efficiency while maintaining a comparable accuracy.
1 Introduction
In this paper, we focus on real-time 3D rigid object pose estimation with RGB-D data. Since rigid object pose has 6 degrees of freedom (x, y, z, pitch, yaw, roll), this problem is also known as 6-DoF pose estimation. A typical method of estimating 6-DoF object poses is through template matching, which requires a mesh model that usually obtained via scanning the target object. A large set of training ‘templates’ with labels that uniformly covers the pose space is then generated by rendering the mesh model. During testing, a sufficiently similar template is found via a distance-based search process, often with approximate nearest neighbour (ANN) techniques.
There are mainly three different ways of performing ANN on template matching: exhaustive, hashing-based and tree-based. Whilst given the feature descriptor, exhaustive method (e.g., LineMOD [8]) guarantees to find the most similar match, its linear complexity is definitely not ideal. Although hashing-based methods have sublinear or even constant complexity during searching, the design of an efficient hash function with good trade-off between memory consumption and matching performance is not trivial. Tree-based methods to solve ANN problem, e.g., -d tree, also significantly lowers the complexity. However, the cascade nature of tree structure makes them prone to be less robust to noise and the efficiency suffers from the curse of dimensionality due to backtrack.
2 Related work
In this section we categorise existing 6-DoF pose estimation methods into distance-based, learning-based and registration-based, then we discuss good strategies of accelerating template matching procedure.
2.1 Distance-based
Distance-based methods approach this problem by defining a distance metric to measure the similarity between samples. Then a set of samples from different view points, usually rendered from 3D object models, is generated as the training set. During testing, the object pose on a query image is retrieved by pairwise comparison between extracted templates and the training set.
To effectively measure the similarity between object views, a compact and discriminative description vector is required. Hinterstoisser et al. [9] present a novel image representation, a rigid template using colour gradient and surface normal as feature descriptors called LineMOD. The templates are synthetically rendered from 3D object mesh models under different scales and view angles. Similar to other traditional template matching approaches, each template matches with all possible locations across the query image to produce a similarity score map. Despite the exhaustive search, it achieves real-time speed for single object pose estimation.
Tree-based approaches apply binary search in multidimensional space. Given a query point and a set of data points, this approach partitions the search space into roughly halves in each iteration, until there is only one data point left in the search space. The complexity is therefore to the number of data points. Though, k-d trees are generally considered not suitable for high-dimensional spaces searching as most of the points in the tree will be evaluated and the efficiency is no better than exhaustive search [5]. One improvement by Beis and Lowe [1], called best-bin first algorithm, uses a backtracking strategy to prioritise the searching queue based on closeness and achieves two orders of magnitude speed up. Another solution applies randomness in building multiple trees to improve the search speed at the cost of the individual k-d tree not always returning the exact nearest neighbours.
Hash table is a well-known data structure that allows a symbol lookup in complexity. In other words, the searching time is constant regardless of the database size. However, hash table is only able to find the exact match while in ANN searching problem we seek approximate matches. The most straight forward solution is hashing the whole quantised feature space into a single hash table so that every possible query point directly maps to their nearest data points. Unfortunately this naive approach is no longer feasible for high-dimensional data. A recent work Kehl et al. [11] employs hashing techniques to achieve sublinear scalability by exploring different hashing key learning strategies and achieve sublinear complexity to the number of templates and outperform the state-of-the-art in terms of runtime.
2.2 Learning-based
Learning-based methods usually generalise better to variations in viewpoint, translation and slight shape deformations.
The methods fall into this category focus on better generalisation to slight variations in translation, local shape and viewpoint. The explicit background/foreground separation is learnt parametrically to deal with heavy background clutter. The result shows these approaches cause less false positives than nearest neighbour approaches. However, the efficacy is their dependency on the quality of negative training samples , and this benefit may not transfer across different domains. Tejani et al. [16] propose to incorporate a one-class learning scheme into the hough forest framework for 6-DoF problems. Rios-Cabrera and Tuytelaars [15] extend LineMOD by learning the templates in a discriminative fashion and handle 10-30 3D objects at frame rates above 10fps using a single CPU core.
2.3 Registration-based
Registration-based methods attempt to fit a pose hypothesis to the observation, by iteratively update and minimise the discrepancy between the query sample and a sample rendered from the current pose hypothesis. A popular choice is the Iterative Closest Point (ICP) [4].
Johnson and Hebert present an early seminal work [10] for simultaneous recognition of multiple objects in scenes containing clutter and occlusion, based on matching surfaces by matching points using the spin image representation. Gordon and Lowe [6] present a feature-based object pose estimation framework that accurately track camera using learned models and SIFT features [12]. The estimation is performed by matching query image features with 3D object model features and solving the Perspective-n-Point (PnP) problem for the 2D-to-3D correspondences. Drost et al. [3] propose a novel method that creates a global model description based on oriented point pair features and matches that model locally using a fast voting scheme. Another recent works [7] improve the framework by introducing a novel matching scheme.
2.4 Accelerating Template Matching
The first difficulty of accelerating template matching with efficient searching schemes comes from the high dimensionality. One single coordinate is far not representative enough to quickly reject candidates, thus leads to suboptimal performance. It is fortunate that many recent works have shown that with a well chosen feature descriptor, few coordinates have enough contrast to reliably find the match. LineMOD [8] achieves good performance by extracting only the best 100 dimensions out of ten thousand from each individual template to perform an optimised exhaustive search. However, it is not trivial to apply efficient searching algorithm based on this approach. Since the best 100 dimensions are in different subspaces so the distance measurement between them is not meaningful, and ten thousand dimensions are simply too large for most of ANN algorithms.
One feasible solution is to cluster the templates into few sets, which has been proposed in few recent works. Hashmod [11] clusters the templates with randomised forest and employs hashing techniques; Discriminatively Trained Templates (DTT) [15] clusters the templates with a bottom-up clustering method and constructs strong classifiers using AdaBoost. The underlying reason is the clustered subsets share common ‘relevant’ feature dimensions, that is to say, the templates in a subset can be well classified using much less coordinates.
Another difficulty is the heavy noises present due to the background clutter, occlusion and other environmental nuisances. Since the features in a template are generally local to a small region, it is very likely that the noises renders some feature dimensions completely irrelevant to the ground truth. It is necessary to exam multiple feature dimensions to increase the signal-to-noise-ratio, thus reliable matching results.
With the same reason, a final validation stage is inevitable to achieve a good precision-recall rate. In this stage, a full similarity measure is calculated between the testing image region and a small subset of templates. Compare to previous stages, validation is expensive and is usually the bottleneck of the whole method, due to much more feature dimensions are involved in the calculation. Therefore to reduce the computational cost, a good trade-off needs to be made between the size of validation subset and matching accuracy.
To sum up, a good approach to accelerate template matching should be: (i) reducing irrelevant feature dimensions; (ii) testing on multiple dimensions simultaneously to be less prone to outliers; (iii) reducing the validation subset as much as possible while maintain the matching robustness; and (iv) optimise validation process to further speed up. With these in mind, we propose our tree-based method to address the problems for efficient template matching.
3 Method
Decision tree is one of the most commonly known model in machine learning for sublinear nearest neighbour search. However, training a tree-based classifier using templates directly are problematic due to insufficient training samples and noisy feature dimensions. In recent works on 3D object pose estimation using tree-based classifier, both [16, 2] use local feature-based representations instead of holistic template to alleviate the overfitting issue. However, this approach requires an additional geometric verification stage, Hough voting in [16] and RANSAC-based optimisation in [2], that likely to drag the process out too long to meet the requirement of real-time applications.
In this work, we propose several extensions to the classic random forest framework to significantly accelerate template matching with marginally loss in accuracy in comparison to the exhaustive search.
3.1 Template Matching in Random Forest Framework
Template matching approach describes each view of the object instance into a single template representation . The template is defined as , where is a reference RGB-D image of an object and denotes the set of locations in . The similarity measure between a template and an input image shifted by can be formalised as:
[TABLE]
where denotes the local feature descriptor, denotes the distance function. Thus, the overall similarity is the sum of all individual corresponding local features differences.
Given an input RGB-D image, we use randomised decision forest to classify sliding windows centred at each pixel location . Leaf node of each tree that the pixel ends up retrieves a set of template so that a final classification is produced by a full validation.
Training Similar to most of recent approaches on 6-dof textureless object pose estimation problem, we synthetically generate our template dataset from 3D object CAD models. The templates are computed from rendered object views on upper hemispheres of several radii, same as the sampling strategy in [9], as shown in Figure 1. Each template is assigned with a 2-tuple label , which denotes the object class and object pose (yaw, roll and pitch angles) respectively.
We first expand the template into a -dimensional descriptor: . In practice, we use LineMOD as our descriptor including an additional object hue map as described in [9]:
[TABLE]
where CG, SN and Hue represent three modalities used in LineMOD: colour gradient, surface normal and hue colour respectively. Each template is therefore a long vector of integers: , where is a list of modality used and is quantised feature value from 0 to 8. The value 0 appears when the feature is not significant, e.g. image colour gradient that norm is below a certain threshold. See [9] for the detail of each modality. For the feature out of the object mask, we use uniform noise to model the background. Different background models are evaluated in [2].
Split Function A template descriptor set , arriving at th node, is partitioned into two subsets by a split function :
[TABLE]
where denotes split node parameters. The split node parameter can be denoted as where selects a small subspace of entire feature space as feature selector function, defines the geometric primitive used to separate the data, denotes thresholds in the binary test. The parameter is chosen to maximise an energy function, usually the information gain to ensure an optimal split. In practice, the design of split function is crucial to achieve good performance. In the later section we will discuss the impact of different split function on template matching performance.
Leaf Validation The training templates are recursively split until meets stopping criteria. This involves the control of tree shape, depth and thus the trade-off between the generalisation power and efficiency. As briefly explained in previous section, full validation on a template set is expensive and almost always the bottleneck of the whole pipeline as in many recent related works. Ideally we want to keep the number of templates in leaf nodes as few as possible while avoiding the overfitting.
In practice when we apply tree-based search on standard template matching, no matter how we set the stopping criteria, the lack of training data and high dimensional feature is likely to lead to overfitting. One possible workaround is a variant of k-d tree approach (Best-bin-first) [1], which backtracks from the leaf node according to a priority queue based on the closeness between query and the bin boundary, until a fixed number of nearest candidates is searched. However, this method is less efficient when large outlier presents as the closeness is no longer reliable. Also, the optimal number of nearest candidates from backtracking varies with object class and can only be decided empirically. We will also address this issue later in our method.
3.2 Split Function for Insufficient, High-dimensional Noisy Data
In many applications, feature selector , where often or is sufficient. However, more dimensions are needed to compensate for the less distinctive, heavily quantised features and higher outlier ratio.
Figure 2 shows an overview of our method. To avoid the superlinear time cost from randomised node optimisation due to high dimensionality, we randomly draw an exemplar from the template set , such that:
[TABLE]
which maximises the energy function :
[TABLE]
where denotes the energy function, denotes a randomly generated set from the entire parameter space. Since the space is greatly reduced with the exemplar approach, the size of should be limited to a small number to maintain the efficiency.
For the choice of energy function, we modify the standard unsupervised entropy to cope with the missing feature values:
[TABLE]
where denotes a foreground boolean indicator such that if located on the object mask on the template and vice versa; denotes an entropy function. In template matching or NN problem in general, each data point is assigned to a unique label. Therefore the standard entropy function is not suitable here. Instead we use an unsupervised variant:
[TABLE]
where is the feature space, i.e. = {1,2,…8} in the case of LINE-MOD descriptor. The entropy measures the uncertainty associated with the feature values given the feature dimension, a higher entropy yields better separation.
Next, we adapt a simple fuzzy rule to on the thresholding to deal with insufficient data. This approach has been proposed in the literature [14, 17] but not drawn much attention from the field of computer vision. This adaptation tends to tolerate imprecise, missing feature values and reduce the classification ambiguity from the split function, achieved by duplication of feature vectors that near to the split subspace.
We modify the binary test in equation 5 and 8 such that:
[TABLE]
thus feature vectors that fall into the ‘fuzzy’ interval will be passed to both child nodes. This approach allows feature vectors to reach multiple leaves, which greatly reduces the overfitting due to lack of training data.
3.3 Preemptive Background Rejector
A fast coarse estimation of objectness is common in many detection methods, since the object of interest generally occupies only a small portion of the testing image. We further propose a preemptive background rejector as an extra split function in each node that sends the query to a ‘background’ leaf node if it fails a binary test. In contrast to most of the background removal methods, our approach does not exploit negative samples. Instead, we make assumption that for all feature vectors that do not exist in the dataset are negative samples. Here we isolate the foreground from background by minimising the entropy in the rejector function so that all foreground feature vectors share similar values:
[TABLE]
which denotes a threshold in to control the acceptance of outlier ratio; denotes a background feature look-up table, such that:
[TABLE]
The query is rejected immediately at a node if .
3.4 Fast Breadth-first Leaf Validation
The leaf nodes contain only tens or at most a hundred templates, however pairwise matching all candidates is still computational expensive. In practice, most of bad candidates can be safely removed by examining only a small portion from the whole feature descriptor. Therefore we propose a further speedup of the validation process with a breadth-first preemption scheme inspired by preemptive RANSAC [13].
As shown in Figure 3, during leaf validation, we equally split template feature descriptors into smaller chunks, each contains a fixed number of features. In each stage, we score and compare all chunks and keep only the candidates that satisfy the pass condition. Scores are accumulated to the next stage and repeat until there is only one or no candidate left.
In our real-time implementation, we use the pass condition
[TABLE]
where s(v) is a scoring function that measures the distance between query and candidate, is a constant threshold. We set the threshold empirically. With this approach, the validation time complexity is reduced to . The chunk size works as a trade-off between accuracy and speed: larger chunk leads to a better robustness but less efficiency, and vice versa. Furthermore, we add depth value as another modality in the validation stage to measure shape similarity.
3.5 Evaluation
Experiments are conducted on LineMOD ACCV12 dataset [9] consisting of 13 object CAD models and testing images for object detection and 6D pose estimation. We apply the same evaluation criteria with a distance factor of . We run the experiments on a single 2.8 GHz Intel Core i7, the whole forest takes around 10 MB. A pyramid scheme is applied in a similar way to [9].
In overall, our pipeline achieve sublinear time complexity and comparable high accuracies as shown in Table 1. Despite the detect rate of our approach is marginally worse than state-of-the-arts, we are at least two times faster than the fastest DTT-3D [15]. Since our approach is tree-based, it is also scalable to more objects. Table 2 shows that we significantly outperform state-of-the-art approaches in speed with more objects. Additionally, both table show that our approach works favourably on simpler object, because the training templates share more similar features so the background is more likely to be rejected before the expensive validation stage.
In Figure 4, we illustrate the effectiveness of our proposed preemptive background rejector. Depending on the object complexity and scene, up to 90% to 97% background locations can be filtered out before validation stage with a very high recall rate. Since textureless objects are generally simple in shape and colour, their rendered templates are likely to share particularly features that can easily rule out background clutters.
Since most candidate locations are rejected at the first node in the decision tree, from Table 1 we can see that the time cost on testing tree itself (T_Tree) is negligible compare to the validation time (T_Valid). The performance is further boosted with multiple trees with random permutation. Here our proposed approach shows another advantage with random forest framework. The time cost for additional trees has sublinear growth as the leaf nodes in each tree are concatenated to remove duplications before entering the validation stage. The result shows that the accuracy is increased by 1.5% with 5 trees but only approximately 2 times slower.
For the adaption of fast breadth-first leaf validation, the time taken is reduced up to 4 or 5 times without loss of accuracy. Note that in practice, due to the overhead this approach does not works well if the set of template is too small, in our case, we set our maximum tree depth to be 8, and 9 for multiple objects.
Finally, our approach achieves sublinear complexity as the result of tree structure and significantly outperform all state-of-the-art approaches in speed.
3.6 Conclusion
We present an efficient and scalable approach for 3D object detection and pose estimation which modifies the randomised forest framework to cope with background noises. The result shows that we significantly outperform the state-of-the-art methods in terms of speed while maintaining a reasonable recognition accuracy. This approach can be generalised to any machine learning task that has very low positive rate, such as for typical face (or object) detection problem, on average only 0.01% of all sub-windows are positive. This assumption is especially true in most real-world applications.
Moreover, this approach is limited only to random forest framework, it has potential to be implemented to any directed-acyclic-graph-(DAG)-based classifier, such as deep convolutional neural network (CNN). It is widely known that typical CNN forward propagation consumes very high computational power and require specialised hardware (GPUs) with high power consumption, often around 250W per card. This motivates the implementation of early termination operator to CNN framework.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. S. Beis and D. G. Lowe. Shape indexing using approximate nearest-neighbour search in high-dimensional spaces. In Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on , pages 1000–1006. IEEE, 1997.
- 2[2] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother. Learning 6d object pose estimation using 3d object coordinates. In European Conference on Computer Vision , pages 536–551. Springer, 2014.
- 3[3] B. Drost, M. Ulrich, N. Navab, and S. Ilic. Model globally, match locally: Efficient and robust 3d object recognition. In CVPR , volume 1, page 5, 2010.
- 4[4] A. W. Fitzgibbon. Robust registration of 2d and 3d point sets. Image and Vision Computing , 21(13):1145–1153, 2003.
- 5[5] J. E. Goodman, J. O’Rourke, and K. H. Rosen. Handbook of discrete and computational geometry . c Rc Press L Lc, 2000.
- 6[6] I. Gordon and D. G. Lowe. What and where: 3d object recognition with accurate pose. In Toward category-level object recognition , pages 67–82. Springer, 2006.
- 7[7] Q. Hao, R. Cai, Z. Li, L. Zhang, Y. Pang, F. Wu, and Y. Rui. Efficient 2d-to-3d correspondence filtering for scalable 3d object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 899–906, 2013.
- 8[8] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua, and V. Lepetit. Gradient response maps for real-time detection of textureless objects. IEEE Transactions on Pattern Analysis and Machine Intelligence , 34(5):876–888, 2012.
