Urban Scene Segmentation with Laser-Constrained CRFs
Charika De Alvis, Lionel Ott, Fabio Ramos

TL;DR
This paper introduces a novel CRF inference method for urban scene segmentation that integrates global constraints from multiple sensor modalities, improving segmentation accuracy in complex environments.
Contribution
A new CRF inference approach formulated as a relaxed quadratic program that efficiently incorporates global constraints for multi-modal scene segmentation.
Findings
Outperforms belief propagation and traditional CRF methods
Effectively combines image and 3D point cloud data
Enhances scene segmentation accuracy in urban environments
Abstract
Robots typically possess sensors of different modalities, such as colour cameras, inertial measurement units, and 3D laser scanners. Often, solving a particular problem becomes easier when more than one modality is used. However, while there are undeniable benefits to combine sensors of different modalities the process tends to be complicated. Segmenting scenes observed by the robot into a discrete set of classes is a central requirement for autonomy as understanding the scene is the first step to reason about future situations. Scene segmentation is commonly performed using either image data or 3D point cloud data. In computer vision many successful methods for scene segmentation are based on conditional random fields (CRF) where the maximum a posteriori (MAP) solution to the segmentation can be obtained by inference. In this paper we devise a new CRF inference method for scene…
| Type | Description | Dimensionality |
|---|---|---|
| Texture | RGB gradient magnitude histogram | |
| RGB gradient orientation histogram | ||
| Colour | RGB mean | 3 |
| RGB std | 3 | |
| HSV histogram | ||
| Location | Super pixel image coordinates |
| Method | Average Precision | Average Recall | Average Accuracy | F1 Score |
|---|---|---|---|---|
| Discriminant Analysis Classifier | ||||
| Loopy Belief Propagation | ||||
| Quadratic Programming Relaxation | ||||
| Higher Order Potentials [12] | ||||
| Constrained Quadratic Programming |
| Quality Measure | Average Precision | Average Recall | Average Accuracy | F1 Score | ||||
|---|---|---|---|---|---|---|---|---|
| Method | HOP | CQP | HOP | CQP | HOP | CQP | HOP | CQP |
| Cyclists &Pedestrians | ||||||||
| Roads & Paved Area | ||||||||
| Vegetation | ||||||||
| Buildings | ||||||||
| Sky | ||||||||
| Vehicles | ||||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Remote Sensing and LiDAR Applications · 3D Surveying and Cultural Heritage
MethodsConditional Random Field
Urban Scene Segmentation with Laser-Constrained CRFs
Charika De Alvis Lionel Ott Fabio Ramos Charika De Alvis, Lionel Ott and Fabio Ramos are with the School of Information Technologies, The University of Sydney, Australia.
Abstract
Robots typically possess sensors of different modalities, such as colour cameras, inertial measurement units, and 3D laser scanners. Often, solving a particular problem becomes easier when more than one modality is used. However, while there are undeniable benefits to combine sensors of different modalities the process tends to be complicated. Segmenting scenes observed by the robot into a discrete set of classes is a central requirement for autonomy as understanding the scene is the first step to reason about future situations. Scene segmentation is commonly performed using either image data or 3D point cloud data. In computer vision many successful methods for scene segmentation are based on conditional random fields (CRF) where the maximum a posteriori (MAP) solution to the segmentation can be obtained by inference. In this paper we devise a new CRF inference method for scene segmentation that incorporates global constraints, enforcing the sets of nodes are assigned the same class label. To do this efficiently, the CRF is formulated as a relaxed quadratic program whose MAP solution is found using a gradient-based optimisation approach. The proposed method is evaluated on images and 3D point cloud data gathered in urban environments where image data provides the appearance features needed by the CRF, while the 3D point cloud data provides global spatial constraints over sets of nodes. Comparisons with belief propagation, conventional quadratic programming relaxation, and higher order potential CRF show the benefits of the proposed method.
I Introduction
Scene segmentation is a core competency for many robotic tasks. It provides the foundation which allows a robot to understand and reason about its environment. For navigation in urban environments such information is critical for safety, as it allows the robot to predict which areas pose a risk due to the presence of dynamic objects. Typically robots carry many different sensors, such as cameras, laser scanners, RGB-D cameras, etc, which typically observe the environment from slightly different angles. This variation in view point and modality makes the optimal combination of sensors very challenging. In this paper we propose a model which effectively combines multiple modalities. The method is applied to image segmentation using camera and laser scan data but is general in nature and applicable to a wide variety of sensor combinations.
Our method is based on a relaxed quadratic program formulation of CRFs for scene segmentation which enforces a set of global constraints. Image data is used to build the CRF graph and potential functions while the depth data is used to formulate global constraints over sets of nodes in the CRF. These constraint sets contain all nodes belonging to the same object, as determined by the depth data and ensure they take the same label during the inference process. The method finds the MAP solution using an efficient gradient based algorithm, based on [27]. The main contributions of the paper are:
- •
Novel CRF formulation using global constraints capable of enforcing label consistency;
- •
Experimental evaluation of the proposed method for scene segmentation using image and 3D laser data gathered by a robotic platform.
The remainder of the paper is structured as follows. In \secrefrelated-work we give an overview of work related to ours, before we introduce our method in \secrefmethod. In \secrefexperiments we provide experimental evaluation of our method before concluding in \secrefconclusion.
II Related Work
In computer vision many successful image segmentation methods are based on graph cuts [5] and refinements such as normalised cuts [22, 4]. Graph cuts represent the image as a graph and attempt to find the set of edges with minimal cost, that when cut results in a segmentation of the image. There are other approaches that work on a similar representation but use a different way of solving the problem. Felzenszwalb and Huttenlocher [7] propose a method that uses greedy local segmentation decisions to obtain accurate global results. A novel graphical model, associative hierarchical random fields, with applications to scene segmentation has proposed in [16]. Stereo vision based scene segmentation is another common method. For example He and Upcroft [9] present a method to build a dense 3D semantic occupancy map of an environment based on semantic labels obtained using a Markov random field which are used to update the semantic labels of the map cells. A similar approach is taken in [21] using a CRF for the segmentation task and creating a triangulated mesh of the environment rather than a voxel grid. All of these approaches use only image data without any additional outside information.
In robotics there has been a lot of work on scene segmentation using multiple modalities, such as camera and 3D laser data. Douillard et al. [6] propose a spatial-temporal CRF method integrating measurements from a conventional laser scanner with images from a calibrated camera. Munoz et al. [18] extract features from image and laser data and use these in a classifier to segment the scene. A method that accumulates image based segmentation results in a point cloud was presented by Hermans et al. [10]. An extension of [7] to RGBD data is presented in [23], taking advantage of distance and normal information. In [3] a link-chain clustering method operating on a super voxel representation of RGB-D data was presented. Xu et al. [25] present a method using multiple independent classifiers with a sophisticated fusion framework. A method that exploits both colour and depth information with the help of a CRF is presented in [26]. This method makes predictions separately on the depth and colour data and fuses the results using a CRF.
Higher order potentials (HOP) allow encoding additional information which the unary and pairwise potentials of a CRF cannot represent. This enables modelling longer range dependencies within the model. These HOP act as soft constraints during the optimisation. Kohli et al. [13] use a Potts model-based CRF with HOP for the task of image segmentation and use a graph cut based algorithm to solve the optimisation problem. Tarlow et al. [24] proposed a method with HOP models and belief propagation, adopting a set of potentials for which efficient message passing rules exist. In [14] a dual decomposition based master-slave framework is presented to solve generic higher order Markov random fields. While HOP can be created from the same information as the constraints, they only form soft constraints and as such can be violated in the final solution. Our approach, in contrast, ensures that the imposed constraints from depth information are satisfied by the solution. Further we exploit much simpler features compared to the state of the art methods while providing higher accuracy for even tricky classes such as pedestrians. Our method also has the potential to be implemented in real time.
III Globally Constrained CRF
Our segmentation method is based on conditional random fields (CRF) with unary and pairwise potentials. The additional a priori information about sets of points which belong to the same group is encoded as constraints on the CRF. A graphical representation of this structure is shown in \figrefgraph-example, where nodes are denoted by circles while edges indicate connections between nodes. The two sets of nodes coloured identically represent sets of nodes constrained to take the same label. The unary and pairwise potentials are based on information extracted from the image while the information about groups of nodes is extracted from 3D laser data. Our goal is to find the best label assignment for each node, i.e. the MAP solution of the CRF. To do this efficiently we represent the CRF as a quadratic program.
III-A Conditional Random Field
The log likelihood model of a conditional random field is given by:
[TABLE]
where is the normaliser, is the set of discrete random variables associated with the super pixels [1] set in the input image. Each super pixel is assigned one of the output labels . The potential functions of the CRF are denoted by for the unary potential and for the pairwise potential defined for each super pixel and each of its neighbours .
III-B Quadratic Program Formulation
The goal is to find the best assignment of labels to the nodes (MAP assignment) considering local and global information. As finding the MAP solution to loglik is NP hard we start by representing it as a quadratic integer program of the following form:
[TABLE]
with the indicator function:
[TABLE]
where encodes if node has been assigned label . This quadratic program formulation penalises disagreements between the data via the indicator function, which guides the model to obtain coherent segmentations. Additionally, qp-2, qp-3 enforce that exactly one label is selected for each node. Relaxing the integer requirement of the quadratic program [27] we obtain:
[TABLE]
Optimising qp-relax yields an approximation to the MAP solution for the segmentation problem. However, it does not yet include the global constrains on sets of nodes. Adding these constraints we obtain:
[TABLE]
where qp-g-2 enforces that all pairs of points and in a constraint set are assigned the same label.
In order to solve qp-relax-global efficiently we follow [15] and rewrite it in matrix notation:
[TABLE]
where encodes the quadratic coefficients (pairwise potentials) and the linear coefficients (unary potentials). is the indicator matrix representing the variables and encodes the global constraints from qp-g-2. The solution to matrix-representation can be found by introducing Lagrange multipliers as follows:
[TABLE]
We can achieve the same maximum as in matrix-representation by making equal to zero. To this end we introduce new variables:
[TABLE]
where has the dimension , while solving nulsp implies that is the null space of . Substituting these two equations back into lagran we obtain:
[TABLE]
This transformation has two benefits: First, the dimensionality of is reduced compared to that of based on the number of constraints. This means that a large number of constraints makes the optimisation problem easier to solve. Second, the optimisation problem is now unconstrained which again makes it easier to solve.
Similar to the transformation from qp-relax-global to matrix-representation we can rewrite uncons using element wise notation as follows:
[TABLE]
with the unary potential and the pairwise potential and denotes if label has been assigned to node . We optimise lagrange-dual-optimisation using gradient ascent which can be done efficiently as the gradient can be computed in closed form [27]:
[TABLE]
with standing for qp-lagrange-1.
This allows us to implement a highly efficient gradient ascent based algorithm as the gradient can be evaluated directly in closed form. Once the algorithm has converged we can extract the values of original indicator variables and thus the MAP label assignments to the variables. To this end we transform the solution for obtained from lagrange-dual-optimisation back into the form of qp-relax-global using . is a column vector whose entries correspond to the values of the . The optimal assignment to each node is found by selecting the label for which holds. This is summarised in \algorefgcrf-algo. The required inputs are the values of the potentials over the possible value settings. Then the gradient(san)is computed and used to update the solution iteratively until convergence is achieved. Finally, the solution is extracted and returned.
IV Experiments
In this section we present experimental evaluation of our proposed framework on the task of image-based scene segmentation. We use the KITTI dataset [8] as it provides typical urban data. The dataset was captured by driving around the city of Karlsruhe. Importantly, the data contains both colour images and Velodyne depth data. The image information is used to build the CRF model structure and potential functions while the Velodyne data is used to construct global constraint sets.
IV-A Model Building
We start by extracting super pixels from the image using SLIC [2] which forms an over segmentation of the original image. From each image we extract roughly super pixels, shown in \figrefexample-sp. Each of these represents a node in our CRF and the goal is to label them with one of the seven different classes: vehicle, pedestrian & cyclist, buildings, road & paved area, sky, vegetation, and unknown. Due to the low sample size of pedestrians and cyclists in the dataset they are assigned to same class. The edges between nodes are defined by their distance within the image, i.e.:
[TABLE]
where is the distance threshold and is the Euclidean distance in image coordinates between centres of two super pixels. All super pixels closer than the user defined distance are connected. In our experiments was set such that each node is connected to roughly ten neighbouring nodes, which results in a roughly grid like structure.
The unary potentials are obtained from the posterior of a pseudo linear discriminant analysis classifier [17]. The classifier is trained on manually labelled images from the KITTI dataset using colour, texture, and location features, shown in \tabreffeatures. The pairwise potentials are derived based on their dissimilarity using colour, texture, and location information of the super pixels, i.e.:
[TABLE]
where is the mean colour of the -th super pixel normalised to by , is the centre of mass of the super pixel in pixel coordinates, normalised to with , and is the colour histogram of the -th super pixel whose difference is computed using the Bhattacharya distance:
[TABLE]
where and are two histograms and is the number of bins in the histograms. This results in a similarity value between [math] and , with [math] encoding identical super pixels. As the constraint function requires a value of for identical super pixels we use the following final pairwise potential function:
[TABLE]
We obtain the global constraints on sets of super pixels by extracting groups of connected points, or objects, from the Velodyne point cloud. This is facilitated by the KITTI dataset providing time synchronised camera images and Velodyne point clouds. To this end we first perform a simple ground plane removal step using RANSAC to find the largest plane aligned with the ground. The remaining points are then grouped using Euclidean distance based clustering [19]. This results in a collection of clusters, of which we only consider those that contain more then points which ensures that each clusters contains only points belonging to a single class. The ground plane as well as the retained segments of this process can be seen in \figrefexample-constraints. The 3D coordinates of the points contained in the selected clusters are then translated into image space coordinates using the extrinsic calibration provided by the KITTI dataset and then associated with super pixels. Based on this mapping we create the constraint sets used in the optimisation. All super pixels that correspond to the same laser segment are constrained to be assigned the same label. Super pixels which do not belong to any of the extracted laser segments are kept unconstrained.
IV-B Segmentation Quality
In the following we present image only CRF solutions obtained using loopy belief propagation (LBP) and quadratic programming (QP) to showcase the quality of the results obtained by these methods without using any additional constraints. Thereafter, we introduce the constraints obtained from the Velodyne and compare the results obtained using a graph-cut based HOP method [12] with our hard constraint based method.
Visual Information Only Segmentation
We present results from three methods, (i) discriminant analysis classifier which provides the unary potentials of the CRF, (ii) loopy belief propagation using the UGM toolbox [20], and (iii) quadratic programming solution [27]. Exemplary results together with the original image and ground truth labels are shown in \figrefvisual-only-results. The first row shows the original colour images while the second row shows the most likely class of the discriminant analysis classifier which is used as the unary potentials of the CRF. As to be expected the classifier output is noisy and incorrect in several places. Both the LBP and QP based CRF solutions produce a much cleaner and consistent result compared with the raw classifier result. However, there are still segmentation errors present due to effects such as shadowing and illumination changes. The quantitative evaluation results from manually labelled images, shown in \tabrefmethod_evaluation, further demonstrates the improvements and also indicates that the QP based solution outperforms the LBP one. This demonstrates that the basis on which our method is built is capable of producing high quality segmentation results before any additional constraints are added, which will be evaluated next.
Laser Constrained Segmentation
In this section we explore the impact additional constraints, extracted from Velodyne data, have on segmentation results by comparing our method to a HOP based method by Kohli et al. [12]. The higher order potentials penalise label inconsistencies between nodes identified to be part of a single segment in the 3D data. Both methods use uniform weight parameters for the unary, pairwise, and higher order potentials, where applicable.
Some exemplary results are shown in \figrefconstraints-based-results with the original image shown on the far left, followed by the result of the HOP based method in the second column, then our method, and finally the hand labelled ground truth. Inspecting the results we can see that the HOP based method struggles to correctly identify distant objects, especially when cars or walls are involved. Additionally, the results our method obtains appear more uniform with less spurious classifications. This difference in behaviour is explained by the way the additional 3D information is used. While our method enforces the constraints the HOP based method is allowed to violate them. The examples in \figrefconstraint-differences show the benefit of using the hard constraints rather then soft constraints. The first two rows showcase this for a single wall while the third row shows the result of this in a scene populated by pedestrians. The first two columns show the original image and the segment extracted from the Velodyne data. Due to the visual appearance of these areas the classifier fails to pick the correct class in some parts of the 3D segment. The HOP based method fixes some classification errors, however, cannot fix every single one. In the case of the pedestrian scene the HOP method even misclassifies all pedestrians. Our method on the other hand is forced to assign a single class to the entire segment and as such the correct class is assigned even to the areas where the classifier makes mistakes.
For a quantitative analysis we compute average precision, recall, accuracy, and F1-score for the different methods on labelled images. As we can see in \tabrefmethod_evaluation the addition of global constraints in our method allows it to significantly outperform the other methods lacking this information and even the HOP method, using the same information, does not provide the same benefits. This shows that adding constraints based on simple information about which areas belong to a single object allows the segmentation to be more accurate. This is good news, as this type of information is readily available in robotic systems. Looking at the performance of the individual classes in \tabrefclass_evaluation we can see that “cyclist & pedestrian” class is the hardest one. This is explained by the fact that instances of this class occur infrequently and as such the classifier has a harder time at classifying them correctly. Furthermore, this class has the smallest appearance in the Velodyne data and as such will only be detected at close range. The other classes exhibit similar performance, which is not surprising, given that they occur frequently in the data and cover larger areas of the scene.
The performance of both constrained QP and HOP can be improved by training the weight parameters of the potential functions, which encodes knowledge about class relationships and object co-occurrence statistics. The advantage of our method is, that it only requires unary and pairwise potentials while HOP has additional higher order potentials, which can be harder and time consuming to learn. This makes the proposed method easier to fine tune as there are fewer parameters involved.
IV-C Runtime Comparison
We start by comparing the runtime required to solve the constrained quadratic program of qp-relax-global directly using NLOPT BOBYQA [11] compared to our proposed framework. As we can see in \figrefruntime, directly solving the quadratic program is not feasible for problems of interesting size. On the other hand, our method scales very favourably with the problem size. Additionally, while typically increasing the number of constraints makes the problem harder and thus slower to solve, our method becomes faster with more constraints. This is caused by the fact that constraints reduce the size of the actual problem we solve. This means that adding more domain knowledge allows us to improve the quality of the result as well as speed up the computation.
A typical CRF derived from the images used in the experiments consists of nodes, each of which can have one of seven different labels, which means we have on the order of random variables. Solving this CRF using the quadratic program formulation qp-relax (with no laser based constraints) takes around while the belief propagation based solution takes . Including the constraints we can reduce the number of nodes to around which results in a much smaller number of variables, around . Solving this problem using gradient based method takes around . All computations were performed on an Intel Core i5 3.20 GHz processor with C++ implementations of the algorithms. Besides the reduction of the number of variables involved our method also requires fewer iterations to converge, around , compared to for the purely image based quadratic program. These two advantages, reduction in number of variables and faster convergence gives our method a significant computational advantage.
V Conclusion
In this paper we presented a novel image segmentation method based on a conditional random field with additional global constraints which encode a priori information about groups of nodes having the same label obtained from a secondary sensor. This CRF is formulated as a relaxed quadratic program whose MAP solution is found using gradient descent based optimisation. We evaluate our method on data from the KITTI project. Each image is pre-processed into super pixels which provide the unary and pairwise potentials of the CRF. The global constraints on sets of super pixels are obtained from Velodyne data. The results show that the addition of these hard constraints significantly improves on the solution obtained without constraints. Runtime comparisons show how black box solvers do not scale for this problem and how our formulation exploits constraints in a way which simplifies the problem. Finally, the proposed method is general and capable of encoding other forms of constraints, such as relative positioning of classes with respect to each other.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Superpixel: Empirical studies and applications. http://ttic.uchicago.edu/ xren/research/superpixel/.
- 2Achanta et al. [2010] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Sabine. SLIC Superpixels. Technical report, EPFL, 2010.
- 3Aijazi et al. [2013] A. Aijazi, P. Checchin, and L. Trassoudaine. Segmentation Based Classification of 3D Urban Point Clouds: A Super-Voxel Based Approach with Evaluation. Remote Sensing , 2013.
- 4Boykov and Funka-Lea [2006] Y. Boykov and G. Funka-Lea. Graph Cuts and Efficient ND Image Segmentation. International Journal of Computer Vision , 2006.
- 5Boykov and Jolly [2001] Y. Boykov and M. Jolly. Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in ND Images. In IEEE International Conference on Computer Vision , 2001.
- 6Douillard et al. [2007] B. Douillard, D. Fox, and F. Ramos. A Spatio-Temporal Probabilistic Model for Multi-Sensor Object Recognition. In Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems , 2007.
- 7Felzenszwalb and Huttenlocher [2004] P. Felzenszwalb and D. Huttenlocher. Efficient Graph-Based Image Segmentation. International Journal of Computer Vision , 2004.
- 8Geiger et al. [2011] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets Robotics : The KITTI Dataset. The International Journal of Robotics Research , 2011.
