Incremental Class Discovery for Semantic Segmentation with RGBD Sensing
Yoshikatsu Nakajima, Byeongkeun Kang, Hideo Saito, Kris Kitani

TL;DR
This paper introduces an incremental open-world semantic segmentation method using RGBD data that discovers new object classes over time by building and analyzing dense 3D maps, enabling semi-real-time performance.
Contribution
A novel approach that incrementally learns new classes in semantic segmentation by leveraging 3D map regions, reducing computational complexity and enabling open-world object discovery.
Findings
Successfully clusters known and unseen object classes.
Achieves semi-real-time processing at 10.7Hz.
Outperforms some state-of-the-art supervised methods.
Abstract
This work addresses the task of open world semantic segmentation using RGBD sensing to discover new semantic classes over time. Although there are many types of objects in the real-word, current semantic segmentation methods make a closed world assumption and are trained only to segment a limited number of object classes. Towards a more open world approach, we propose a novel method that incrementally learns new classes for image segmentation. The proposed system first segments each RGBD frame using both color and geometric information, and then aggregates that information to build a single segmented dense 3D map of the environment. The segmented 3D map representation is a key component of our approach as it is used to discover new object classes by identifying coherent regions in the 3D map that have no semantic label. The use of coherent region in the 3D map as a primitive element,…
| Method | classes in training dataset | novel classes | mean | |||||||||||
| bed | book | chair | floor | furn. | obj. | sofa | table | wall | ceil. | pict. | tv | wind. | IoU | |
| U-Net [27] | 50.32 | 22.42 | 36.55 | 55.62 | 36.85 | 27.27 | 48.44 | 33.78 | 55.14 | - | - | - | - | - |
| Nakajima et al. [21] | 62.82 | 27.27 | 42.56 | 68.43 | 44.62 | 24.63 | 45.04 | 42.30 | 26.82 | - | - | - | - | - |
| Ours + 3D Map [36] | 62.80 | 23.96 | 33.10 | 63.41 | 50.58 | 27.28 | 58.68 | 40.23 | 54.53 | 31.42 | 19.37 | 43.98 | 31.30 | 41.59 |
| Ours | 64.22 | 22.28 | 41.79 | 67.38 | 56.15 | 28.61 | 49.31 | 40.95 | 63.18 | 29.30 | 28.69 | 52.20 | 53.92 | 46.05 |
| o 1.22X[l,m]—ccccccccc—cccc—c Method | classes in training dataset | novel classes | ||||||||||||
| bed | book | chair | floor | furn. | obj. | sofa | table | wall | ceil. | pict. | tv | wind. | mean IoU | |
| Ours GEO-only | 51.95 | 21.47 | 35.99 | 64.75 | 50.28 | 28.36 | 48.98 | 39.14 | 55.80 | 29.76 | 25.38 | 44.88 | 52.43 | 42.24 |
| Ours CNN-only | 60.07 | 28.23 | 37.55 | 63.53 | 49.48 | 30.16 | 51.21 | 43.59 | 59.94 | 20.82 | 22.60 | 39.41 | 42.30 | 42.22 |
| Ours | 64.22 | 22.28 | 41.79 | 67.38 | 56.15 | 28.61 | 49.31 | 40.95 | 63.18 | 29.30 | 28.69 | 52.20 | 53.92 | 46.05 |
| Component | Processing time |
|---|---|
| Building 3D segmentation map * | 18.2 ms |
| Deep feature extraction ** | 35.9 ms |
| Geometric feature extraction | 8.2 ms |
| Entropy computation | 2.3 ms |
| Feature/Entropy update | 33.4 ms |
| 3D segment clustering | 13.4 ms |
| Total | 93.2 ms |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Incremental Class Discovery for Semantic Segmentation with RGBD Sensing
Yoshikatsu Nakajima1,2
Byeongkeun Kang1
Hideo Saito2
Kris Kitani1
1Carnegie Mellon University
{byeongkk,kkitani}@andrew.cmu.edu
2Keio University
{nakajima,saito}@hvrl.ics.keio.ac.jp
Abstract
This work addresses the task of open world semantic segmentation using RGBD sensing to discover new semantic classes over time. Although there are many types of objects in the real-word, current semantic segmentation methods make a closed world assumption and are trained only to segment a limited number of object classes. Towards a more open world approach, we propose a novel method that incrementally learns new classes for image segmentation. The proposed system first segments each RGBD frame using both color and geometric information, and then aggregates that information to build a single segmented dense 3D map of the environment. The segmented 3D map representation is a key component of our approach as it is used to discover new object classes by identifying coherent regions in the 3D map that have no semantic label. The use of coherent region in the 3D map as a primitive element, rather than traditional elements such as surfels or voxels, also significantly reduces the computational complexity and memory use of our method. It thus leads to semi-real-time performance at 10.7Hz when incrementally updating the dense 3D map at every frame. Through experiments on the NYUDv2 dataset, we demonstrate that the proposed method is able to correctly cluster objects of both known and unseen classes. We also show the quantitative comparison with the state-of-the-art supervised methods, the processing time of each step, and the influences of each component.
1 Introduction
Building a semantically annotated 3D map (i.e., semantic mapping) has become a vital research topic in computer vision and robotics communities since it provides 3D location information as well as object/scene category information. It is naturally very useful in a wide range of applications including robot navigation, mixed/virtual reality, and remote robot control. In most of these applications, it is important to achieve both high accuracy and efficiency. Considering robot navigation, robots need to recognize objects accurately and efficiently to navigate actively changing environments without any accident. In mixed reality systems, accuracy and efficiency are important to achieve more natural interactions without delay. When controlling surgical robots remotely, they are even more essential.
Consequently, many researches have been conducted to develop an accurate and efficient system for semantic mapping [15, 10, 20, 21, 29, 38, 41, 16, 18]. Most of the recent semantic mapping systems consist of two principal components, building a 3D map from RGBD images and processing semantic segmentation on either images or the built 3D maps [15, 10, 20, 21]. Since the introduction of RGBD sensors such as Microsoft Kinect [42], many approaches have been presented for building a 3D map from RGBD images [22, 12, 14, 17]. Regarding semantic segmentation, as semantic segmentation algorithms for images have been studied in many literatures, most semantic mapping systems have adapted these algorithms. Recently, since convolutional neural networks (CNNs) improve the performance of semantic segmentation further [19, 31, 4], CNNs have been incorporated to enhance the accuracy of semantic mapping [20, 21].
While these advancements improved the accuracy and efficiency of the overall system, the methods have limitations in the objects that the system can recognize. As previous semantic mapping systems recognize objects and scenes by training a pixel-level classifier (e.g., random forest or CNNs) using a training dataset, the systems are only able to recognize categories in the training dataset. This is a huge limitation for autonomous systems considering real-world consists of numerous objects/stuffs. Hence, we propose a novel system that can properly cluster both known objects and unseen things to enable discovering new categories. The proposed method first generates object-level segments in 3D. It then performs clustering of the object-level segments to associate objects of the same class and to discover new object classes.
The contributions of this paper are as follow: (1) We present, to the best of our knowledge, the first semantic mapping system that can properly discover clusters of both known objects and unseen objects in a 3D map (see Figure 1); (2) To effectively handle deep features and geometric cues in clustering, we propose to estimate the reliability of the deep features from CNNs using the entropy of the probability distribution from CNNs. We then use the estimated confidence for weighting the two types of features; (3) We propose to utilize segments instead of elements (i.e., surfel and voxel) in assigning/updating features and in clustering to efficiently reduce computational cost and space complexity. It enables the overall framework to run in semi-real-time; (4) We improve object proposals in a 3D map by utilizing both geometric and color information. It is especially important for the regions with poor geometric characteristics (e.g., pictures on a wall) ; (5) We demonstrate the effectiveness and efficiency of the proposed system by training CNNs on a subset of classes in a dataset and by discovering the other subset of classes by using the proposed method.
2 Related Work
Semantic Scene Reconstruction Koppula et al. presented one of the earliest works on semantic scene reconstruction using RGBD images [15]. Given multiple RGBD images, they first stitched the images to a single 3D point cloud. They then over-segmented the point cloud and labeled each segment using a graphical model.
As many 2D semantic segmentation approaches achieved impressive results [19, 31, 4], Hermans et al. proposed to use 2D semantic segmentation for 3D semantic reconstruction instead of segmenting 3D point clouds [10]. They first processed 2D semantic segmentation using randomized decision forests (RDF) and refined the result using a dense Conditional Random Fields (CRF). They then transferred class labels to 3D maps. Since, recently, convolutional neural networks (CNNs) further improved 2D semantic segmentation, McCormac et al. presented a system that utilizes CNNs for 2D semantic segmentation instead of RDF [20]. While we focus on semantic scene reconstruction methods using RGBD images, there are methods using stereo image pairs [29, 38, 41] and using a monocular camera [16, 18].
While all the previous works [15, 10, 20, 29, 38, 41, 16, 18] can recognize only learned object classes, we propose, to the best of our knowledge, the first semantic scene reconstruction system that can segment unseen object classes as well as trained classes.
Image Segmentation Image segmentation has been studied in many literatures [26, 32, 3, 7, 5, 8, 11, 9, 1, 2]. Relatively recently, Pont-Tuset et al. proposed an approach for bottom-up hierarchical image segmentation [24]. They developed a fast normalized cuts algorithm and also proposed a hierarchical segmenter that uses multiscale information. They then employed a grouping strategy that combines multiscale regions into highly-accurate object proposals. As convolutional neural networks (CNNs) have become a popular approach in semantic segmentation, Xia et al. proposed a CNN-based method for unsupervised image segmentation [39]. They segmented images by learning autoencoders with the consideration of the normalized cut and smoothed the segmentation outputs using a conditional random field. They then processed hierarchical segmentation that first converts over-segmented partitions into weighted boundary maps and then merges the most similar regions iteratively.
Considering RGBD data, Yang et al. proposed a two-stage segmentation method that consists of over-segmentation using 3-D geometry enhanced superpixels and graph-based merging [40]. They first applied a K-means-like clustering method to the RGBD data for over-segmentation using an 8-D distance metric constructed from both color and 3-D geometrical information. They then employed a graph-based model to relabel the superpixels into segments considering RGBD proximity, texture similarity, boundary continuity, and the number of labels.
Comparing to the previous works [26, 32, 3, 7, 5, 8, 11, 9, 1, 2, 24, 39, 40], this work differs from them in two aspects. First, we propose a segmentation algorithm for 3D reconstructed scenes rather than images. Second, we aim to group pixels with the same semantic meaning to a cluster even if they are distant or separated by another segment.
3 Class Discovery for Semantic Segmentation
In order to discover new classes of semantic segments, we need a method for aggregating and clustering unknown segments (i.e., segments of the image which cannot be classified into a known class). A central component of our proposed approach is the segmentation of a dense 3D reconstruction of the scene, which we call the 3D segmentation map, which is used to aggregate information about each 2D image segment and that information is used to perform the 3D segment clustering to discover new ‘semantic’ (a nameless category) classes.
To incrementally discover object classes using RGBD sensing, we first propose to build a 3D segmentation map for object-level segmentation in 3D. Second, we perform clustering of the object-level segments to associate objects of the same class and to discover new object classes. Figure 2 shows the overview of the proposed framework. Given an input RGBD stream, we build a 3D segmentation map (Section 3.1) and process incremental clustering (Section 3.2). The incremental clustering consists of extracting features for each frame (Section 3.2.1) and clustering using the features (Section 3.2.2). The output of the proposed method is the visualization of clustering membership on a reconstructed 3D map.
3.1 Building the 3D Segmentation Map
As mentioned above, the 3D segmentation map is the key data structure which is used to aggregate information about 2D image segmentation to discover new semantic classes. Building the 3D segmentation map is an incremental process, which consists of the following four processes applied to each frame: (1) SLAM for dense 3D map reconstruction; (2) SLIC for superpixel segmentation; (3) Agglomerative clustering; and (4) Updating the 3D segmentation map. We describe the details of each processing step below.
Dense SLAM. In order to estimate camera poses and incrementally build a 3D map, we employ the dense SLAM method, InfiniTAM v3 [25]. The method builds 3D maps using an efficient and scalable representation method which was proposed by Keller et al. [14]. The representation is a point-based description with normal information and is referred to as surfel. We denote surfels using .
The surfel is a fundamental element in our reconstructed 3D map (like pixels on an image). Given a new depth frame, we generate surfels and fuse them into the existing reconstructed 3D map. Hence, building the 3D segmentation map includes building a reconstructed 3D map using SLAM and grouping surfels in the reconstructed 3D map.
RGBD SLIC. For every RGBD frame, we first implement a modified SLIC superpixel segmentation algorithm to generate roughly 250 superpixels (small image regions) for each frame. To use both color information and geometric cues, we define a new distance metric that uses the color image in the CIELAB color space, the normal map , and the image coordinates . The distance between pixels and is computed as follows:
[TABLE]
where and are constants for weighting and . Given the set of superpixels from the SLIC segmentation, we compute the averaged color , vertex , and normal of each superpixel , which will be used to further merge superpixels into bigger 2D regions.
Agglomerative Clustering. Since the SLIC superpixel segmentation tends to generate a grid of segments with similar sizes, we perform agglomerative clustering and merging to produce object-level segments. The clustering and merging are based on the similarity in , , and between superpixels. Specifically, we compute the similarity in color space, the geometric distance in 3D space, and convexity in shape. We then merge the superpixels if all the measured similarity/distances meet the following conditions.
Consider two neighboring superpixels . The , , and are computed as follow:
[TABLE]
Given , , and , the pair of superpixels are merged only when they satisfy the predetermined criteria:
[TABLE]
where , , and denote the corresponding thresholds for , , and , respectively. Regarding convexity criteria, it is based on the observation that objects on captured images usually have convex shapes [36]. Consequently, we penalize merging regions with concave shapes. is computed using the noise model in [23], which presented the relationship between noise and distance from a sensor.
3D Segmentation Map Update. Given the 2D segmentation result of current frame, we update the 3D segmentation map. We employ the efficient and scalable segment propagation method in [36] to assign/update a segment label to each surfel .
3.2 Incremental Clustering
In the previous section, we generate object-level segments by clustering and merging superpixels. The object-level segments are then used to update the 3D segmentation map. Given the object-level segments in the 3D segmentation map, incremental clustering aims to discover new object classes by clustering the object-level segments. To cluster the segments, we first extract features using an input RGBD frame and the 3D segmentation map. We then cluster by computing weighted similarity between the segments. We describe the details of online feature extraction in Section 3.2.1 and those of 3D segment clustering in Section 3.2.2 (also, see Figure 4).
3.2.1 Online Feature Extraction
In order to accurately associate objects of the same class or to discover new object classes, we need a method for estimating similarity between object segments in the 3D segmentation map. While measuring similarity can be as simple as computing distance in color space, more meaningful measurement is required to accurately determine object classes. Moreover, as objects often appear on multiple frames in a consecutive video, we can improve the similarity measurement by utilizing previous frames. Lastly, as record-keeping all the information from previous frames is expensive, we need an efficient method to store the past information.
To estimate more meaningful similarity, we utilize both features from color images and geometric features as they are often complementary. Especially, as convolutional neural networks have achieved impressive results in per-pixel classification tasks [19, 31, 4], we extract features from color images using CNNs. The extracted deep features and geometric features for each frame are then used to update the features for each segment in the 3D segmentation map. By aggregating the features from all the previous frames, we improve the robustness of the features in the 3D segmentation map. Moreover, storing/updating the features for each segment is a very effective strategy for both saving memory usage and reducing computations for 3D segment clustering. Considering the number of segments is much smaller than that of surfels in the 3D map, the reduction in memory usages is very significant. Specifically, the memory usage is reduced from to where and denote the number of surfels and that of object-level segments in the 3D segmentation map, respectively; and represent the dimension of the deep features and that of the geometric features, respectively.
While CNNs have shown impressive results, the reliability of deep features can vary depending on the region of the input image. We hypothesize that the regions that CNNs can predict a class with high confidence, can be clustered accurately using deep features. Hence, we estimate the reliability of deep features using predicted probability distribution from CNNs. Specifically, we compute the confidence by calculating the entropy of the predicted probability distribution. We then, based on the estimated reliability, compute weighted affinity using the similarity of geometric feature and that of deep features between object-level segments in the 3D segmentation map.
For deep features and entropy, we employ the U-Net architecture [27] since our target applications (e.g. robot navigation) often demand short processing time. The network takes only to process an input image of resolution. Also, by using the same network for both processing, we can save computations.
Geometric Feature Extraction/Update. To extract translation/rotation-invariant and noise-robust geometric features, we first estimate a Local Reference Frame (LRF) for each segment. We then extract geometric features for each segment using a fast and unique geometric feature descriptor, Global Orthographic Object Descriptor (GOOD) [13].
Given a depth map, to estimate LRF for each segment, we need the 3D segmentation map on the current image plane. Hence, we first render the segmentation map to the current image plane and obtain the rendered segmentation map with segment labels . We then compute the LRF by processing the Principal Component Analysis (PCA) for each segment. In more details about processing the PCA, we first compute the normalized covariance matrix and then perform eigenvalue decomposition. The normalized covariance matrix of each segment is computed using the vertex map and the rendered segmentation map as follows:
[TABLE]
where represents the geometric center of the segment ; denotes the set of vertices that belong to the segment on the current frame; represents the number of elements in the set. We then perform eigenvalue decomposition on as follows:
[TABLE]
where is a matrix with three eigenvectors; is a diagonal matrix with the corresponding eigenvalues. is directly utilized as the LRF.
Lastly, we employ a fast and unique geometric feature descriptor, GOOD [13]. For each , we transform the set of vertices using the LRF. We then fed the transformed vertices into the descriptor to obtain the frame-wise geometric feature .
After computing using the current depth map, the geometric features in the 3D segmentation map are updated as follows:
[TABLE]
This updates are applied to all segments on the rendered segmentation map . denotes the constant for normalizing the feature vector .
Deep Feature Extraction/Update. We utilize the output of the layer just before the last classification layer for deep feature map. The per-frame deep feature map is denoted as . The size of is where and represent the width and height of an input image, respectively; and denotes the number of channels (i.e. the dimension of the features) which is .
We update the deep features for each segment in the 3D segmentation map by employing incremental averaging approach and by using the per-frame deep features. Since deep features and entropy are extracted for each pixel while geometric features are obtained for each segment , the procedures for updating are slightly different. The deep features are updated as follows:
[TABLE]
where is the normalizing constant for ; is all the coordinates on .
Entropy Computation/Update. The entropy is computed by first estimating the probability distribution for each class and by measuring the Shannon entropy [30] using the probability distribution. As the network is trained for semantic segmentation, the probability distribution is obtained by the output of the softmax layer of the network. The entropy is computed at each pixel as follows:
[TABLE]
where is the probability for the class at the pixel . Then, is used for updating the entropy for each segment in the 3D segmentation map as follows:
[TABLE]
where is all the coordinates on .
3.2.2 3D Segment Clustering
Given semantic and geometric features in the 3D segmentation map from the feature updating stage, we apply a graph-based unsupervised clustering algorithm to cluster regions in the 3D segmentation map. We specifically employ the Markov clustering algorithm (MCL) [37] because of the flexible number of clusters and computational cost. Since we aim to be able to handle unknown objects in a scene, we need the number of clusters (class categories) to be flexible, like the MCL. Furthermore, since the computational cost of the MCL comes from the multiplication of two matrices with the size , where denotes the number of nodes in the graph, the cost can be turned into by parallelizing the processing in a GPU. Accordingly, it reduces processing time and makes more appropriate for an online system.
We define the similarity between nodes (i.e. regions and in the 3D segmentation map). The weight values and are first computed using the entropy and the number of classes in the training dataset for the U-Net as follows:
[TABLE]
The denominator is selected to make to be in [0,1] considering the maximum value of is . The similarity is then defined using and as follows:
[TABLE]
where is a predefined constant. Based on the assumption that the entropies of regions belonging to unknown object categories are high, the similarity measurement between these regions is more relying on geometric features than deep features. We calculate the similarity for each pair of region and feed the similarities to the MCL to update clusters.
4 Experiments and Results
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. From contours to regions: An empirical evaluation. In 2009 IEEE Conference on Computer Vision and Pattern Recognition , pages 2294–2301, June 2009.
- 2[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 33(5):898–916, May 2011.
- 3[3] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence , 23(11):1222–1239, Nov 2001.
- 4[4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence , 40(4):834–848, April 2018.
- 5[5] D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence , 24(5):603–619, May 2002.
- 6[6] C. Couprie, C. Farabet, L. Najman, and Y. Le Cun. Indoor semantic segmentation using depth information. In International Conference on Learning Representations , 2013.
- 7[7] Y. Deng and B. S. Manjunath. Unsupervised segmentation of color-texture regions in images and video. IEEE Transactions on Pattern Analysis and Machine Intelligence , 23(8):800–810, Aug 2001.
- 8[8] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision , 59(2):167–181, Sep 2004.
