TL;DR
This paper introduces a novel method for semantic boundary detection that explicitly models annotation noise, leading to sharper boundary predictions and improved performance over existing methods, while also enhancing coarse segmentation labels.
Contribution
It proposes a new layer and loss function that reason about annotation noise and true boundaries, improving boundary detection accuracy in noisy datasets.
Findings
Over 4% improvement in MF(ODS)
18.61% increase in AP
Enhanced coarse segmentation labels
Abstract
We tackle the problem of semantic boundary prediction, which aims to identify pixels that belong to object(class) boundaries. We notice that relevant datasets consist of a significant level of label noise, reflecting the fact that precise annotations are laborious to get and thus annotators trade-off quality with efficiency. We aim to learn sharp and precise semantic boundaries by explicitly reasoning about annotation noise during training. We propose a simple new layer and loss that can be used with existing learning-based boundary detectors. Our layer/loss enforces the detector to predict a maximum response along the normal direction at an edge, while also regularizing its direction. We further reason about true object boundaries during training using a level set formulation, which allows the network to learn from misaligned labels in an end-to-end fashion. Experiments show that we…
| Metric | Method | aero | bike | bird | boat | bottle | bus | car | cat | chair | cow | table | dog | horse | mbike | person | plant | sheep | sofa | train | tv | mean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MF (ODS) | CASENet | 74.84 | 60.17 | 73.71 | 47.68 | 66.69 | 78.59 | 66.66 | 76.23 | 47.17 | 69.35 | 36.23 | 75.88 | 72.45 | 61.78 | 73.10 | 43.01 | 71.23 | 48.82 | 71.87 | 54.93 | 63.52 |
| CASENet-S | 76.26 | 62.88 | 75.77 | 51.66 | 66.73 | 79.78 | 70.32 | 78.90 | 49.72 | 69.55 | 39.84 | 77.25 | 74.29 | 65.39 | 75.35 | 47.85 | 72.03 | 51.39 | 73.13 | 57.35 | 65.77 | |
| SEAL | 78.41 | 66.32 | 76.83 | 52.18 | 67.52 | 79.93 | 69.71 | 79.37 | 49.45 | 72.52 | 41.38 | 78.12 | 74.57 | 65.98 | 76.47 | 49.98 | 72.78 | 52.10 | 74.05 | 58.16 | 66.79 | |
| Ours (NMS Loss) | 78.96 | 66.20 | 77.53 | 54.76 | 69.42 | 81.77 | 71.38 | 78.28 | 52.01 | 74.10 | 42.79 | 79.18 | 76.57 | 66.71 | 77.71 | 49.70 | 74.99 | 50.54 | 75.50 | 59.32 | 67.87 | |
| Ours (NMS Loss + AAlign) | 80.15 | 67.80 | 77.69 | 54.26 | 69.54 | 81.48 | 71.34 | 78.97 | 51.76 | 73.61 | 42.82 | 79.80 | 76.44 | 67.68 | 78.16 | 50.43 | 75.06 | 50.99 | 75.31 | 59.66 | 68.15 | |
| AP | CASENet | 50.53 | 44.88 | 41.69 | 28.92 | 42.97 | 54.46 | 47.39 | 58.28 | 35.53 | 45.61 | 25.22 | 56.39 | 48.45 | 42.79 | 55.38 | 27.31 | 48.69 | 39.88 | 45.05 | 34.77 | 43.71 |
| CASENet-S | 67.64 | 53.10 | 69.79 | 40.51 | 62.52 | 73.49 | 63.10 | 75.26 | 39.96 | 60.74 | 30.43 | 72.28 | 65.15 | 56.57 | 70.80 | 33.91 | 61.92 | 45.09 | 67.87 | 48.93 | 57.95 | |
| SEAL | 74.24 | 57.45 | 72.72 | 42.52 | 65.39 | 74.50 | 65.52 | 77.93 | 40.92 | 65.76 | 33.36 | 76.31 | 68.85 | 58.31 | 73.76 | 38.87 | 66.31 | 46.93 | 69.40 | 51.40 | 61.02 | |
| Ours (NMS Loss) | 75.85 | 59.65 | 74.29 | 43.68 | 65.65 | 77.63 | 67.22 | 76.63 | 42.33 | 70.67 | 31.23 | 77.66 | 74.59 | 61.04 | 77.44 | 38.28 | 69.53 | 40.84 | 71.69 | 50.39 | 62.32 | |
| Ours (NMS Loss + AAlign) | 76.74 | 60.94 | 73.92 | 43.13 | 66.48 | 77.09 | 67.80 | 77.50 | 42.09 | 70.05 | 32.11 | 78.42 | 74.77 | 61.28 | 77.52 | 39.02 | 68.51 | 41.46 | 71.62 | 51.04 | 62.57 |
| Method | aero | bike | bird | boat | bottle | bus | car | cat | chair | cow | table | dog | horse | mbike | person | plant | sheep | sofa | train | tv | mean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CASENet [36] | 83.3 | 76.0 | 80.7 | 63.4 | 69.2 | 81.3 | 74.9 | 83.2 | 54.3 | 74.8 | 46.4 | 80.3 | 80.2 | 76.6 | 80.8 | 53.3 | 77.2 | 50.1 | 75.9 | 66.8 | 71.4 |
| SEAL [37] | 84.9 | 78.6 | 84.6 | 66.2 | 71.3 | 83.0 | 76.5 | 87.2 | 57.6 | 77.5 | 53.0 | 83.5 | 82.2 | 78.3 | 85.1 | 58.7 | 78.9 | 53.1 | 77.7 | 69.7 | 74.4 |
| Ours | 85.8 | 80.0 | 85.6 | 68.4 | 71.6 | 85.7 | 78.1 | 87.5 | 59.1 | 78.5 | 53.7 | 84.8 | 83.4 | 79.5 | 85.3 | 60.2 | 79.6 | 53.7 | 80.3 | 71.4 | 75.6 |
| Label Quality | 4px error | 8px error | 16px error | 32px error |
| Num.Clicks per Image | 70.34 | 44.76 | 26.78 | 14.64 |
| Test IoU | 91.22 | 78.95 | 62.20 | 41.31 |
| GrabCut | 68.74 | 70.32 | 69.76 | 62.82 |
| Ours(Coarse-to-Fine) IoU | 92.78 | 88.16 | 82.89 | 76.20 |
| Label Quality | 4px error | 8px error | 16px error | 32px error | Real Coarse |
| Num.Clicks per Image | 175.23 | 95.63 | 49.21 | 27.00 | 98.78 |
| Test IoU | 74.85 | 53.32 | 33.71 | 19.44 | 48.67 |
| GrabCut | 26.00 | 28.51 | 29.35 | 25.99 | 32.11 |
| Ours(Coarse-to-Fine) IoU | 78.93 | 69.21 | 58.96 | 50.35 | 67.43 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations
David Acuna1,2,3 Amlan Kar2,3 Sanja Fidler1,2,3
1NVIDIA 2University of Toronto 3Vector Institute
{davidj, amlan}@cs.toronto.edu, [email protected]
Abstract
We tackle the problem of semantic boundary prediction, which aims to identify pixels that belong to object(class) boundaries. We notice that relevant datasets consist of a significant level of label noise, reflecting the fact that precise annotations are laborious to get and thus annotators trade-off quality with efficiency. We aim to learn sharp and precise semantic boundaries by explicitly reasoning about annotation noise during training. We propose a simple new layer and loss that can be used with existing learning-based boundary detectors. Our layer/loss enforces the detector to predict a maximum response along the normal direction at an edge, while also regularizing its direction. We further reason about true object boundaries during training using a level set formulation, which allows the network to learn from misaligned labels in an end-to-end fashion. Experiments show that we improve over the CASENet [36] backbone network by more than 4% in terms of MF(ODS) and 18.61% in terms of AP, outperforming all current state-of-the-art methods including those that deal with alignment. Furthermore, we show that our learned network can be used to significantly improve coarse segmentation labels, lending itself as an efficient way to label new data.
Project Page: https://nv-tlabs.github.io/STEAL/
1 Introduction
Image boundaries are an important cue for recognition [26, 14, 2]. Humans can recognize objects from sketches alone, even in cases where a significant portion of the boundary is missing [5, 19]. Boundaries have also been shown to be useful for 3D reconstruction [23, 21, 38], localization [35, 31], and image generation [18, 32].
In the task of semantic boundary detection, the goal is to move away from low-level image edges to identifying image pixels that belong to object (class) boundaries. It can be seen as a dual task to image segmentation which identifies object regions. Intuitively, predicting semantic boundaries is an easier learning task since they are mostly rooted in identifiable higher-frequency image locations, while region pixels may often be homogenous in color, leading to ambiguities for recognition. On the other hand, the performance metrics are harder: while getting the coarse regions right may lead to artificially high Jaccard index [20], boundary-related metrics focus their evaluation tightly along the object edges. Getting these correct is very important for tasks such as object instance segmentation, robot manipulation and grasping, or image editing.
However, annotating precise object boundaries is extremely slow, taking as much as 30-60s per object [1, 8]. Thus most existing datasets consist of significant label noise (Fig. 1, bottom left), trading quality with the labeling efficiency. This may be the root cause why most learned detectors output thick boundary predictions, which are undesirable for downstream tasks.
In this paper, we aim to learn sharp and precise semantic boundaries by explicitly reasoning about annotation noise during training. We propose a new layer and loss that can be added on top of any end-to-end edge detector. It enforces the edge detector to predict a maximum response along the normal direction at an edge, while also regularizing its direction. By doing so, we alleviate the problem of predicting overly thick boundaries and directly optimize for Non-Maximally-Suppressed (NMS) edges. We further reason about true object boundaries using a level-set formulation, which allows the network to learn from misaligned labels in an end-to-end fashion.
Experiments show that our approach improves the performance of a backbone network, i.e. CASENet [36], by more than 4% in terms of MF(ODS) and 18.61% in terms of AP, outperforming all current state-of-the-art methods. We further show that our predicted boundaries are significantly better than those obtained from the latest DeepLab-v3 [9] segmentation outputs, while using a much more lightweight architecture. Our learned network is also able to improve coarsely annotated segmentation masks with 16px, 32px error improving their accuracy by more than 20% IoU and 30% IoU, respectively. This lends our method as an efficient means to collect new labeled data, allowing annotators to coarsely outline objects with just a few clicks, and generating finer ground-truth using our approach. We showcase this idea by refining the Cityscapes-coarse labelset, and exploiting these labels to train a state-of-the-art segmentation network [9]. We observe a significant improvement of more than 1.2% in some of the refined categories.
2 Related Work
Semantic Boundary Detection. Learning-based semantic edge detection dates back to [28] which learned a classifier that operates on top of a standard edge detector. In [15], the authors introduced the Semantic Boundaries Dataset (SBD) and formally studied the problem of semantic contour detection in real world images. They proposed the idea of an inverse detector which combined bottom-up edges and top-down detection. More recently, [36] extended the CNN-based class-agnostic edge detector proposed in [34], and allowed each edge pixel to be associated with more than one class. The proposed CASENet architecture combined low and high-level features with a multi-label loss function to supervise the fused activations.
Most works use non-maximum-suppression [6] as a postprocessing step in order to deal with the thickness of predicted boundaries. In our work, we directly optimize for NMS during training. We further reason about misaligned ground-truth annotations with real object boundaries, which is typically not done in prior work. Note that our focus here is not to propose a novel edge-detection approach, but rather to have a simple add-on to existing architectures.
The work most closely related to ours is SEAL [37], in that it deals with misaligned labels during training. Similar to us, SEAL treats the underlying ground truth boundaries as a latent variable that is jointly optimized during training. Optimization is formulated as a computationally expensive bipartite graph min-cost assignment problem. In order to make optimization tractable, there are no pair-wise costs, i.e. two neighboring ground-truth pixels can be matched to two pixels far apart in the latent ground-truth, potentially leading to ambiguities in training. In our work, we infer true object boundaries via a level set formulation which preserves connectivity and proximity, and ensures that the inferred ground-truth boundaries are well behaved. Moreover, SEAL is limited to the domain of boundary detection and needs to have reasonably well annotated data, since alignment is defined as a one-to-one mapping between annotated and inferred ground-truth. In our method, substantial differences (topology and deviation) in ground truth can be handled. Our approach can thus be naturally used to refine coarse segmentation labels, lending itself as a novel way to efficiently annotate datasets.
Level Set Segmentation. Level Set Methods [27] have been widely used for image segmentation [7, 12, 30, 17, 24, 4, 22, 13] due to their ability to automatically handle various topological changes such as splitting and merging. Most older work derived different level set formulations on top of standard image gradient observations, while recent work swapped those with neural network outputs [17]. In [24], the authors proposed a deep structured active contours method that learns the parameters of an active contour model using a CNN. [20] introduced a method for object proposal generation, by learning to efficiently place seeds such that critical level sets originating from these seeds hit object boundaries. In parallel work, [33] learns CNN feature extraction and levelset evolution in an end-to-end fashion for object instance annotation. In our work, we exploit level set optimization during training as a means to iteratively refine ground-truth semantic boundaries.
3 The STEAL Approach
In this section, we introduce our Semantically Thinned Edge Alignment Learning (STEAL) approach. Our method consists of a new boundary thinning layer together with a loss function that aims to produce thin and precise semantic edges. We also propose a framework that jointly learns object edges while learning to align noisy human-annotated edges with the true boundaries during training. We refer to the latter as active alignment. Intuitively, by using the true boundary signal to train the boundary network, we expect it to learn and produce more accurate predictions. STEAL is agnostic to the backbone CNN architecture, and can be plugged on top of any existing learning-based boundary detection network. We illustrate the framework in Fig. 2.
Subsec. 3.1 gives an overview of semantic boundary detection and the relevant notation. Our boundary thinning layer and loss are introduced in Subsec. 3.3. In Subsec. 3.4, we describe our active alignment framework.
3.1 Semantic Aware Edge-Detection
Semantic Aware Edge-Detection [36, 37] can be defined as the task of predicting boundary maps for object classes given an input image . Let indicate whether pixel belongs to class . We aim to compute the probability map , which is typically assumed to decompose into a set of pixel-wise probabilities modeled by Bernoulli distributions. It is computed with a convolutional neural network with sigmoid outputs, and parameters . Each pixel is thus allowed to belong to multiple classes, dealing with the cases of multi-class occlusion boundaries. Note that the standard class-agnostic edge detection can be seen as a special case with (consuming all foreground classes).
Semantic Edge Learning.
State-of-the-art boundary detectors are typically trained using the standard binary cross entropy loss adopted from HED [34]. To deal with the high imbalance between the edge and non-edge pixels, a weighting term is often used, where accounts for the number of non-edge pixels among all classes in the mini-batch, and is the total number of pixels. In the multi-class scenario, the classes are assumed to be independent [36, 37]. Therefore, in learning the following weighted binary cross-entropy loss is minimized:
[TABLE]
where indicates the ground-truth boundary labels.
3.2 Semantic Boundary Thinning Layer
In the standard formulation, nearby pixels in each boundary map are considered to be independent, and can cause the predictions to “fire” densely around object boundaries. We aim to encourage predictions along each boundary pixel’s normal to give the maximal response on the actual boundary. This is inspired by edge-based non-maximum suppression (NMS) dating back to Canny’s work [6]. Furthermore, we add an additional loss term that encourages the normals estimated from the predicted boundary maps to agree with the normals computed from ground-truth edges. The two losses work together in producing sharper predictions along both the normal and tangent directions.
3.3 Thinning Layer and NMS Loss
Formally, during training we add a new deterministic layer on top of the boundary prediction map. For each positive ground-truth boundary pixel for class we normalize the responses along the normal direction as follows:
[TABLE]
where:
[TABLE]
Here, , and denotes the maximum distance of a pixel from along the normal. See Fig. 2 for a visualization. We compute the normal direction from the ground-truth boundary map using basic trigonometry and a fixed convolutional layer that estimates second derivatives. The parameter in Eq. (2) denotes the temperature of the softmax. We use and .
Intuitively, we want to encourage the true boundary pixel to achieve the highest response along its normal direction. We do this via an additional loss, referred to as the NMS loss, that pushes the predicted categorical distribution computed with towards a Dirac delta target distribution:
[TABLE]
Note that indexes only the positive boundary pixels for each class, other pixels do not incur the NMS loss. We compute in Eq. (2) for non-integral locations using a bilinear kernel.
Direction Loss.
Ideally, the predicted boundaries would have normal directions similar to those computed from the ground-truth boundaries. We follow [3] to define the error as the mean squared loss function in the angular domain:
[TABLE]
with the ground-truth normal direction in boundary pixel , and the normal computed from the predicted boundary map. We use the same convolutional layer on top of to get . Finally, we compute our full augmented loss as the combination of the following three terms:
[TABLE]
where are hyper-parameters that control the importance of each term (see Experiments).
3.4 Active Alignment
Learning good boundary detectors requires high quality annotated data. However, accurate boundaries are time consuming to annotate. Thus datasets tradeoff between quality and annotation efficiency. Like [37], we notice that the standard SBD benchmark [15] contains significant label noise. In this section, we propose a framework that allows us to jointly reason about true semantic boundaries and train a network to predict them. We adopt a level set formulation which ensures that the inferred “true” boundaries remain connected, and are generally well behaved.
Let denote a more accurate version of the ground-truth label , which we aim to infer as part of our training procedure. We define as a curve. Our goal is to jointly optimize for the latent variable and parameters of the boundary detection network. The optimization is defined as a minimization of the following loss:
[TABLE]
The second term is the log-likelihood of the model and can be defined as in the previous section. The first term encodes the prior that encourages to be close to . Inspired by [7], we define this term with an energy that describes “well behaved” curves:
[TABLE]
where we define as the following decreasing function:
[TABLE]
Here, is a hyper-parameter that controls the effect of . Intuitively, this energy is minimized when the curve lies in areas of high probability mass of , and is close, by a factor of , to the given ground-truth .
We can minimize Eq. (9) via steepest-descent, where we find the gradient descent direction that allows to deform the initial (here we use the given noisy ground-truth as the initial curve) towards a (local) minima of Eq. (9) [7]:
[TABLE]
Here is the Euclidean curvature and is the inward normal to the boundary. Details of this computation can be found in [7], Appendix B and C.
Eq (11) follows the level-set approach [27], where the curve is a [math] level-set of an embedding function , i.e. a set of points satisfying . By differentiating the latter equation, it is easy to show that if evolves according to then the embedding function can be deformed as [7]. We can thus rewrite the evolution of in terms of as follows:
[TABLE]
where can be seen as a constant velocity that helps to avoid certain local minima [7]. Eq. 12 can also be interpreted as the Geodesic Active Contour formulation of the Level Set Method [7, 27].
3.5 Learning
Minimizing Eq. (8) can be performed with an iterative two step optimization process. In one step, we evolve the provided boundary towards areas where the network is highly confident. The number of evolution steps indexed by can be treated as a latent variable and is selected by choosing the that minimizes Eq. (8). In the second step, we optimize using the computed .
Formally, we want to solve:
[TABLE]
where we iterate between holding fixed and optimizing :
[TABLE]
and optimizing via Eq. (7) while holding fixed. Here is a constant that does not affect optimization.
3.6 Coarse-to-Fine Annotation
Embedding the evolution of in that of has two main benefits. Firstly, topological changes of are handled for free and accuracy and stability can be achieved by using proper numerical methods. Secondly, can be naturally interpreted as a mask segmenting an object, where corresponds to the segmented region. Moreover, our approach can also be easily used to speed up object annotation. Assume a scenario where an annotator draws a coarse mask inside an object of interest, by using only a few clicks. This is how the coarse subset of the Cityscapes dataset has been annotated [11]. We can use our learned model and levelset formulation (Eq. (12)), setting and to evolve the given coarse mask by iterations to produce an improved segmentation mask whose edges align with the edges predicted by our model.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] D. Acuna, H. Ling, A. Kar, and S. Fidler. Efficient annotation of segmentation datasets with polygon-rnn++. In CVPR , 2018.
- 2[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. T-PAMI , 33(5):898–916, May 2011.
- 3[3] M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In CVPR , 2017.
- 4[4] M. Bergtholdt, D. Cremers, and C. Schnörr. Variational segmentation with shape priors. In O. F. N. Paragios, Y. Chen, editor, Handbook of Mathematical Models in Computer Vision . Springer, 2005.
- 5[5] I. Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review , 94:115–147, 1987.
- 6[6] J. Canny. A computational approach to edge detection. T-PAMI , 8(6):679–698, June 1986.
- 7[7] V. Caselles, R. Kimmel, and G. Sapiro. Geodesic active contours. IJCV , 22(1):61–79, 1997.
- 8[8] L.-C. Chen, S. Fidler, A. Yuille, and R. Urtasun. Beat the mturkers: Automatic image labeling from weak 3d supervision. In CVPR , 2014.
