Mono-Stixels: Monocular depth reconstruction of dynamic street scenes
Fabian Brickwedde, Steffen Abraham, Rudolf Mester

TL;DR
Mono-stixels introduce a monocular approach to reconstruct dynamic street scenes by jointly estimating depth, motion, and semantics, overcoming static scene limitations of traditional methods.
Contribution
This paper presents mono-stixels, a novel monocular method that estimates depth, motion, and semantics simultaneously for dynamic scenes using 1D energy minimization.
Findings
Reliable depth reconstruction of static and moving objects
Outperforms static scene methods on KITTI 2015
Compact representation suitable for real-time applications
Abstract
In this paper we present mono-stixels, a compact environment representation specially designed for dynamic street scenes. Mono-stixels are a novel approach to estimate stixels from a monocular camera sequence instead of the traditionally used stereo depth measurements. Our approach jointly infers the depth, motion and semantic information of the dynamic scene as a 1D energy minimization problem based on optical flow estimates, pixel-wise semantic segmentation and camera motion. The optical flow of a stixel is described by a homography. By applying the mono-stixel model the degrees of freedom of a stixel-homography are reduced to only up to two degrees of freedom. Furthermore, we exploit a scene model and semantic information to handle moving objects. In our experiments we use the public available DeepFlow for optical flow estimation and FCN8s for the semantic information as inputs and…
| Stixel type | Semantic classes | Geometry | Motion |
|---|---|---|---|
| ground | road, sidewalk, terrain | lying | static |
| static object | building, poles/signage, vegetation | upright | static |
| dynamic object |
vehicle,
two-wheeler, person |
upright | potentially moving |
| sky | sky |
infinite
distance |
static |
| RMSE | (lower is better) | |
| Rel. Error | (lower is better) | |
| Threshold | % of s.t. | (higher is better) |
| Baseline - | Ours - | ||
| Metric | SFM [8] | Mono-Stixels | |
| Compactness | 465 k | 5.69 k | |
| Overall | Rel. Error [%] | 35.57 | 12.30 |
| RMSE [m] | 11.29 | 6.29 | |
| [%] | 66.59 | 70.61 | |
| [%] | 76.98 | 88.95 | |
| [%] | 82.90 | 95.83 | |
| [%] | 87.44 | 97.73 | |
| Static | Rel. Error [%] | 17.08 | 11.22 |
| RMSE [m] | 7.93 | 6.31 | |
| [%] | 76.18 | 76.03 | |
| [%] | 87.57 | 90.42 | |
| [%] | 93.19 | 95.95 | |
| [%] | 95.65 | 97.73 | |
| Moving | Rel. Error [%] | 158.07 | 19.40 |
| RMSE [m] | 23.56 | 6.16 | |
| [%] | 3.16 | 34.69 | |
| [%] | 6.94 | 79.20 | |
| [%] | 14.75 | 95.04 | |
| [%] | 33.10 | 97.72 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Mono-Stixels: Monocular depth reconstruction of dynamic street scenes
Fabian Brickwedde1,2, Steffen Abraham1, Rudolf Mester2 1 Robert Bosch GmbH, Hildesheim, Germany [email protected]2 VSI Lab, Goethe University, Frankfurt, Germany [email protected]
Abstract
In this paper we present mono-stixels, a compact environment representation specially designed for dynamic street scenes. Mono-stixels are a novel approach to estimate stixels from a monocular camera sequence instead of the traditionally used stereo depth measurements.
Our approach jointly infers the depth, motion and semantic information of the dynamic scene as a 1D energy minimization problem based on optical flow estimates, pixel-wise semantic segmentation and camera motion. The optical flow of a stixel is described by a homography. By applying the mono-stixel model the degrees of freedom of a stixel-homography are reduced to only up to two degrees of freedom. Furthermore, we exploit a scene model and semantic information to handle moving objects.
In our experiments we use the public available DeepFlow for optical flow estimation and FCN8s for the semantic information as inputs and show on the KITTI 2015 dataset that mono-stixels provide a compact and reliable depth reconstruction of both the static and moving parts of the scene. Thereby, mono-stixels overcome the limitation to static scenes of previous structure-from-motion approaches.
©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: 10.1109/ICRA.2018.8460490
I INTRODUCTION
Autonomous vehicles and advanced driver assistance systems need to understand the surrounding environment to identify critical objects, parking slots or navigate through the street scene. Therefore, a representation of the geometric and semantic layout of the street scene is necessary.
One useful compact medium-level representation for street scenes is the so called stixel world that was introduced by Badino et al. [1] and extended to a multi-layer stixel world by Pfeiffer and Franke [2]. The stixel world is defined as a column-wise segmentation of the image in thin sticklike planar elements, the stixels. The underlying world model distinguishes three types of stixels: lying ground stixel, upright object stixel and sky stixel at infinite distance. Thus, the geometric layout of a stixel can be described by one value for the depth resulting in a quite compact representation. Furthermore, Schneider et al. [3] introduce the semantic stixels that additionally consist of a semantic label like road, vegetation or vehicle. Many constructive works show that this medium-level representation is suitable for many high-level vision tasks like object segmentation [4], object tracking [5] or region of interests selection for pedestrian detection [6]. Even autonomous driving based on the stixel world seems to be possible as shown by Franke et al. [7].
The mentioned works build on a dense disparity map from a stereo sensor. In general a disparity can be seen as a scaled value for the inverse depth which is also derivable from the optical flow induced by a single moving camera. However, this structure-from-motion principle does not hold for moving objects as shown in Fig. 1 and thus the mentioned works are not applicable for a monocamera in dynamic scenes.
Therefore, this work introduces mono-stixels, a compact environment representation derived from a monocular camera sequence. Inputs for our methods are a dense optical flow field, pixel-wise semantic segmentation and camera motion estimation. Mono-stixels are especially designed to handle static and moving parts in a joint optimization resulting in a reliable depth reconstruction of the whole dynamic scene as shown in Fig. 1.
II RELATED WORK
We see three categories of related work. The first category consists of monocular depth estimates based on the structure-from-motion or multi-view geometry principle [8]. Mentionable approaches are PMO [9], DTAM [10], LSD-SLAM [11], ORB-SLAM [12] or REMODE [13]. Based on the optical flow, the camera motion is estimated and the depth of the environment is derived based on the principle of multi-view geometry. However, this principle only holds for static scenes and none of these methods can handle moving objects. Klappstein [14] proposed a way to detect moving objects based on the optical flow and camera motion. Static points in the scene have to fulfill the epipolar geometry, the positive depth as well as the positive height constraint. But there are still some epipolar conformant independently moving objects (IMOs) like oncoming vehicles that are not detectable. Ranftl et al. [15] distinguished objects in the scene by their fundamental matrix and proposed to reconstruct the points for each fundamental matrix individually. As a second step the different reconstructions are scaled in that way that they are connected, e.g. that a moving vehicle stands on the ground plane. However, epipolar conformant IMOs are represented by the same fundamental matrix as the static scene. Thus, these objects are not distinguishable and the reconstruction fails. Consequently, current structure-from-motion based approaches are limited to static scenes and IMOs that violate the epipolar constraint.
The second category comprises methods that provide a pixel-wise semantic segmentation or leverage this kind of information for a different vision task. The Cityscape dataset [16] gives an overview of the performance of pixel-wise semantic segmentation methods in street scenes and shows that deep neural networks allow to perform semantic segmentation on a previously unprecendented level of performance. One of the pioneering works are the fully convolutional networks (FCN) introduced by Long et al. [17]. Semantics can be an useful information for different vision tasks. For example Bai et al. [18] use the semantic information to distinguish static objects from potentially moving traffic participants for optical flow estimation. Furthermore, Schneider et al. [3] showed that the semantic information can support the stereo-based stixel estimation resulting in a higher depth accuracy.
The last category of related works are the monocamera-based stixel estimation methods. Wolcott and Eustice [19] proposed a column-wise partitioning of the image in ground, obstacle and background based on a prior appearance ground map and optical flow. Levi et al. [20] introduced the Stixel-Net, a convolutional neural network for the detection of the ground contact point of the first obstacle in each column. However, both methods do not provide a multi-layer stixel world including a depth reconstruction of the whole scene. Current monocamera-based stixel estimation methods are more related to a free space estimation or road segmentation method.
There are two main contributions provided by the mono-stixels approach presented here. First, mono-stixels are a novel approach to estimate a multi-layer stixel world from a monocular camera sequence providing a depth reconstruction of the whole dynamic scene. Second, mono-stixels exploit semantic information and scene constraints for a joint monocular depth estimation of the static and moving parts of street scene. Thereby, mono-stixels provide reliable depth estimates even for the epipolar conformant IMOs, a novelty compared to previous structure-from-motion approaches.
III METHOD
In this chapter, we describe our stixels estimation method. In the first section we define the mono-stixels model and segmentation problem which is formulated as an energy minimization problem in the second section. Finally, the last section describes the inference via dynamic programming.
Our method assumes to have a dense optical flow field, a pixel-wise semantic segmentation and the camera motion as inputs. The dense optical flow field is defined as the motion of each image point from the current image to a previous image including a confidence measure. If the confidence is not provided by the optical flow algorithm itself, this could be estimated based on the structure tensor as described in [21]. For the semantic information, we require that a pixel-wise semantic segmentation algorithm provides for each pixel the pseudo-probabilities (scores) to belong to the semantic classes listed in Table I.
III-A Mono-Stixels
Mono-stixels are defined as thin stick-like planar and rigid moving elements in the scene. To reduce the segmentation problem to a 1D optimization problem, the image of width and height is divided into columns of a fixed width and the segmentation problem is formulated and solved independently for each column as in [2].
[TABLE]
The vector represents the segmentation of the column in mono-stixels, where each mono-stixel is defined by its bottom and top image coordinates , its semantic class , its stixel type , its inverse depth and its motion . Regarding the characteristic of traffic participants, the motion is defined as a 2D-translation of the stixel on the ground plane. The rotational motion is neglected due to the small horizontal extent of a stixel.
We define four stixel types that are solely distinguishable by their geometry and motion (see Table I). Furthermore, we associate each semantic class to exactly one stixel type. Thereby, we leverage the semantic segmentation to prefer a specific stixel type. For example, a high score for the vehicle class prefers a dynamic object stixel.
III-B Energy minimization problem
The segmentation problem of one column in Eq. (1) is formulated as an energy minimization problem. The energy term captures a data likelihood depending on the optical flow and the pixel-wise semantic segmentation . Additionally, a pairwise prior term incorporates prior knowledge about the typical structure of street scenes as in [2].
[TABLE]
The prior term prefers a stixel segmentation with a geometric layout likely for the typical structure of street scenes and further regulates the model complexity by adding a constant value for each new stixel.
[TABLE]
We follow the proposed structural prior term of Pfeiffer and Franke [2] in a slightly different definition. First, the gravity constraint prefers objects standing on the ground plane. Thus, if an object stixel succeeds a ground or sky stixel , the structural energy term is defined as with
[TABLE]
where and are tunable parameters to express the prior knowledge. is the height of the stixel over the reference ground, which is defined by the ground stixels in that column. If there is not any ground stixel, the reference ground is defined by the height of the camera mounting position on the vehicle.
However, if the previous object is also an object stixel the bottom of the stixel might not be the bottom of the object due to occlusion or a depth discontinuity inside the object. Therefore, in this case the structural prior term is defined as the minimum of the gravity and an ordering constraint with
[TABLE]
and are again tunable parameters.
Furthermore, we prefer small discontinuities in the height of the ground plane, e.g. caused by a slanted street or a sidewalk. Thus, for ground stixels we define with
[TABLE]
where and are tunable parameters and is the height difference between the ground stixel and reference ground as in Eq. (4).
The unary term or data likelihood rates the consistency of an individual stixel hypotheses based on the semantic segmentation and optical flow field . The data likelihoods are modeled to be independent across the pixels and therefore independent across the rows in that column.
[TABLE]
where and weights the data likelihood of each part.
The data likelihood of the semantic segmentation prefers stixel hypothesis having a semantic class with high class scores inside the stixel segment.
[TABLE]
where regards that the class score might be not reliable in some cases.
Note that stixels of a given type can only have one of the associated semantic classes defined in Table I. Thus, a high class score prefers both the corresponding semantic class and the stixel type associated with this class.
Analogously, the term rates the consistency of the optical flow for an individual stixel hypothesis. Due to the definition of a stixel to be a planar part of the scene, the expected optical flow of one pixel can be described by a homography for a given stixel hypothesis [8]:
[TABLE]
The normal vector is defined by the geometric definition of the stixel type to be lying or upright (see Table I) and the assumption that the stixel is facing the camera center. Furthermore, for static stixel types the rotation matrix and translation vector is solely defined by the camera motion. For sky stixels, there is the special case that the inverse depth is zero which simplifies the homography to .
In contrast to that, for dynamic stixels also the translational motion of the stixel itself has to be regarded. However, the expected optical flow is still describable by a homography with the translational vector defined as the relative translation between camera and stixel hypothesis . Thereby, the homography serves as the common description of the optical flow for static and dynamic stixels in a monocular camera setup.
Pfeiffer and Franke [2] proposed to define the data likelihood for stereo depth measurements based on the difference between expected and measured disparity. Similar to this approach we define our data likelihood based on the residual flow . Therefore, we define a measurement model of the optical flow estimates as a Gaussian mixture model consisting of a normal distribution for inliers with covariance and a broad uniform distribution for outliers. Switching to the log-domain and sum up the constant parts in one parameter the following energy term is derived:
[TABLE]
III-C Solving the mono-stixels segmentation problem
The previous sections describe the mono-stixels segmentation problem (Eq. (1)) for one column as a 1D energy minimization problem (Eq. (2)). This 1D energy minimization problem is solvable via dynamic programming, e.g. by using the Viterbi algorithm. However, even with dynamic programming the run time grows quadratic with all possible labels for a stixel hypothesis which results in a high computational effort.
Therefore, we follow the proposed simplification to a minimum path problem in [22]. Only the stixel types and segmentation labels are optimized via dynamic programming while the other labels are approximated. Approximation in this context means to find the inverse depth , semantic class and motion labels for one stixel hypothesis given its segmentation and type .
As the semantic class label we take that class associated with the stixel type that has the highest class scores in the corresponding image segment .
[TABLE]
For the inverse depth and motion label in the first step we optimize the optical flow related energy term. That means we need to find the best homography for the corresponding image segment considering the geometry and motion definitions of the stixel type .
[TABLE]
In general, a homography has eight degrees of freedom. However, our stixel model allows to apply several constraints to reduce the degrees of freedom. , and are defined by the intrinsic calibration, camera motion or stixel type as discussed for Eq. (9) and only the inverse depth and translation are free parameters. Assuming a translational motion of the camera and stixel on the ground plane, the translation vector of the homography can be described by two degrees of freedom. Thus, due to the linear dependency of the inverse depth and in Eq. (9) the homography has two degrees of freedoms, namely and with where rotates the 2D-translation on the ground plane in the camera coordinate system. Furthermore, for static object and ground stixels the translation is solely defined by the camera motion which means that only the inverse depth is left as one degree of freedom of the homography .
Exploiting the mono-stixel model shows that a stixel-homography only has up to two degrees of freedom for a given stixel type. This means that one optical flow vector is enough to derive a stixel-homography by solving the system for static stixel and for dynamic stixel. These systems are defined by rearranging the equation .
To find the best homography for the whole image segment as defined in equation (12), we use a MLESAC-based [23] approach:
Take one optical flow vector to compute a hypothesis for the stixel-Homography . 2. 2.
Compute the optical flow related energy term for this homography 3. 3.
Repeat step one until all optical flow vectors are used 4. 4.
Choose that homography with the lowest energy
For static object and ground stixels, this directly defines the inverse depth label of the stixel as the degree of freedom of the homography .
In contrast to that, for dynamic object stixel only the linear combination is defined by and either the inverse depth or one component of the translation vector can still be chosen freely. Based on our scene model it is possible to choose a plausible inverse depth by taking that one that minimizes the structural prior term . If the previous stixel is an ground stixel, this is achieved by that inverse depth that leads the stixel to stand on the ground stixel. If the previous stixel is an object stixel, the inverse depth is less clear as the ordering constraint is zero for every stixel behind the previous one. However, in this case we take the same depth as the previous stixel which might be roughly correct if both stixels belong to the same object.
Note that depending on the application it might not be required to choose a certain value for the inverse depth. The energy terms excluding the structural prior term are still defined by and and these values are able to represent the time to contact in the longitudinal and lateral direction with the camera as reference point. This might be enough for a criticality analysis.
Based on this approximations all labels are defined during the inference via dynamic programming. The achieved run time of this optimization is for one column and thereby for all columns for an image of width and height .
IV EXPERIMENTS
In this chapter we give an in-depth analysis of the performance of our proposed mono-stixels. First, we describe the setup of our experiments comprising the used dataset and inputs. Second, we define a metric and baseline and finally we present an evaluation based on this metric and show some example results of our proposed mono-stixels.
IV-A Setup
We evaluate our approach on the KITTI-Stereo’15 [25] dataset that contains 200 images captured from a forward facing camera in different street scenes. The dataset provides sparse ground truth depth obtained from a Velodyne HDL-64 laser scanner and 3D-CAD models for moving vehicles. Furthermore, based on the scene flow static and moving objects are distinguishable which enables us a more distinctive evaluation.
As mentioned, our method assumes a dense optical flow field, camera motion and a pixel-wise semantic segmentation as inputs. In our experiments the camera motion is provided by the visual odometry method in [26]. The dense optical flow field is estimated by the public available DeepFlow [24]. Referring to the KITTI optical flow benchmark [25], there are also real-time capable dense optical flow methods with comparable or even better performance than DeepFlow. The optical flow is estimated on keyframes with a minimum driven distance of . These keyframes do not exist for all images in the dataset, thus, the evaluation is done on 171 images out of the 200 images of the KITTI-Stereo’15 dataset.
For the pixel-wise semantic segmentation we train our own FCN8s network [17] based on the VGG architecture [27]. We follow the proposed training in [3]. First, we train our net on the Cityscape dataset [16]. Second, we fine-tune this net on 470 images of the KITTI dataset collected from the labeled subsets in [28], [29], [30], [31], [32], [33]. Furthermore, we unified the training data to the 10 classes mentioned in Table I.
IV-B Metric and Baseline
To evaluate our method we follow the proposed metric in [34]. is defined as the set of ground truth values and is the corresponding estimated depth at that position.
Additionally, we indicate the compactness by the number of values needed to represent the depth of the environment. Three values per mono-stixel are needed to describe the depth of the scene: one for the segmentation, one for the stixel type and one for the inverse depth. Thus, the compactness is defined as three times the mean number of mono-stixels per image. For a dense depth map the compactness is equal to the number of pixel per image.
As described in the related work section, the structure-from-motion approaches are still limited to static scenes or IMOs that violate the epipolar constraint. Our experiments should clearly show that our methods provide a reliable depth reconstruction of moving objects. Therefore, we separately evaluate the static and moving parts of the scene. Note, that many objects in street scene are epipolar conformant IMOs and it is essential to handle these objects to produce good results. Furthermore, we implement as a baseline a traditional structure-from-motion method by performing a triangulation [8] for each image point based on a dense optical flow field and camera motion. As inputs of this method we use exactly the same dense optical flow and camera motion estimation as for our mono-stixels. Thus, the comparison to this baseline directly shows the impact of the mono-stixels and is not effected by the performance of the used optical flow or camera motion estimation methods.
IV-C Results
Fig. 2 shows some example outputs of the proposed mono-stixels using a stixel width of and hand-tuned parameters. Please note that we additionally provide a video consisting of a 3D visualization in the supplementary material.
These examples show that mono-stixels provide a compact and plausible reconstruction of the static as well as the dynamic parts of the scene. The performance of mono-stixels compared to the structure-from-motion baseline is shown in Fig. 3 including different moving objects like oncoming, preceding or crossing vehicles. The examples show that the structure-from-motion baseline fails for all moving objects while the mono-stixels provide a reliable depth reconstruction for all moving objects. This is also shown in the qualitative evaluation in Table II. Mono-stixels show a comparable and even more robust performance for the static parts compared to the baseline. But, only the mono-stixels are able to provide also reliable depth estimates for the moving objects. Furthermore, only one eightieth of the values is needed to represent the depth of the scene.
But we also show failure cases in the examples in Fig. 2. First, the mono-stixels depend on the performance of the used optical flow algorithm. In the first example, the optical flow estimation fails for the upper part of the pole on the left side which results in wrong depth estimates of that part. Furthermore, the last example shows that the estimation might fail for scenes that violate the world assumptions. In that example the high slope of the grass in the right part violate the assumption of a flat ground plane which results in a high number of segments in the lower part and wrong depth estimates of the upper part. Furthermore, a slanted ground plane would violate our assumption regarding the translational motion of a moving object. However, this should be solvable by applying the concept of slanted stixels [35] that are specially designed to represent high slope in the ground plane.
V CONCLUSIONS
This paper presented mono-stixels, a novel stixel estimation method from monocular camera sequences. The homography-based formulation allows to describe the optical flow for static and dynamic stixels in a common way for a joint optimization. Furthermore, we showed how to leverage a pixel-wise semantic label to distinguish static and potentially moving objects and how to use a scene and ground plane model especially for the depth reconstruction of moving objects in this stixel estimation method.
Many works still showed the suitability of the medium-level-representation with stixels as primitive elements for high-level vision tasks. Thus, the mono-stixels approach could be the enabler to use the stixel world in a monocamera setup for driver assistance systems or autonomous vehicles.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] H. Badino, U. Franke, and D. Pfeiffer, “The Stixel World - A Compact Medium Level Representation of the 3D-World,” in Joint Pattern Recognition Symposium . Springer, 2009, pp. 51–60.
- 2[2] D. Pfeiffer and U. Franke, “Towards a Global Optimal Multi-Layer Stixel Representation of Dense 3D Data,” in BMVC , vol. 11, 2011, pp. 51–1.
- 3[3] L. Schneider, M. Cordts, T. Rehfeld, D. Pfeiffer, M. Enzweiler, U. Franke, M. Pollefeys, and S. Roth, “Semantic Stixels: Depth is not enough,” in Proc. of IEEE Intelligent Vehicles Symposium (IV) . IEEE, 2016, pp. 110–117.
- 4[4] F. Erbs, A. Witte, T. Scharwaechter, R. Mester, and U. Franke, “Spider-based Stixel object segmentation,” in Proc. of IEEE Intelligent Vehicles Symposium Proceedings (IV) . IEEE, 2014, pp. 906–911.
- 5[5] D. Pfeiffer and U. Franke, “Modeling dynamic 3D environments by means of the Stixel World,” IEEE Intelligent Transportation Systems Magazine , vol. 3, no. 3, pp. 24–36, 2011.
- 6[6] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool, “Fast Stixel computation for fast pedestrian detection,” in European Conference on Computer Vision . Springer, 2012, pp. 11–20.
- 7[7] U. Franke, D. Pfeiffer, C. Rabe, C. Knoeppel, M. Enzweiler, F. Stein, and R. Herrtwich, “Making Bertha see,” in Proc. of the IEEE International Conference on Computer Vision Workshops (ICCV Workshops)) , 2013, pp. 214–221.
- 8[8] R. Hartley and A. Zisserman, Multiple view geometry in computer vision . Cambridge university press, 2003.
