TL;DR
RefineContourNet is a ResNet-based multi-path CNN that effectively fuses multi-level features for superior object contour and edge detection, achieving state-of-the-art results on standard datasets.
Contribution
Introduces a novel multi-path refinement CNN that fuses high, mid, and low-level features in a specific order for improved contour detection.
Findings
Achieves an ODS of 0.752 on PASCAL dataset for contour detection.
Reaches an ODS of 0.824 on BSDS500 for edge detection.
Outperforms previous methods in both contour and edge detection tasks.
Abstract
A ResNet-based multi-path refinement CNN is used for object contour detection. For this task, we prioritise the effective utilization of the high-level abstraction capability of a ResNet, which leads to state-of-the-art results for edge detection. Keeping our focus in mind, we fuse the high, mid and low-level features in that specific order, which differs from many other approaches. It uses the tensor with the highest-levelled features as the starting point to combine it layer-by-layer with features of a lower abstraction level until it reaches the lowest level. We train this network on a modified PASCAL VOC 2012 dataset for object contour detection and evaluate on a refined PASCAL-val dataset reaching an excellent performance and an Optimal Dataset Scale (ODS) of 0.752. Furthermore, by fine-training on the BSDS500 dataset we reach state-of-the-art results for edge-detection with an ODS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAverage Pooling · Global Average Pooling · 1x1 Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Bottleneck Residual Block · Max Pooling · Kaiming Initialization · Residual Connection · Convolution
11institutetext: Helmut Schmidt University, Department of Signal Processing and Communications, Holstenhofweg 85, 22043 Hamburg, Germany
11email: {andre.kelm, udo.zoelzer}@hsu-hamburg.de
22institutetext: Hamburg University of Technology, Department of Mechatronics,
Am Schwarzenberg-Campus 1, 21073 Hamburg, Germany
22email: [email protected]
Object Contour and Edge Detection with RefineContourNet
André Peter Kelm 11 0000-0003-4146-7953
Vijesh Soorya Rao 22 0000-0003-4746-2694
Udo Zölzer 11
Abstract
A ResNet-based multi-path refinement CNN is used for object contour detection. For this task, we prioritise the effective utilization of the high-level abstraction capability of a ResNet, which leads to state-of-the-art results for edge detection. Keeping our focus in mind, we fuse the high, mid and low-level features in that specific order, which differs from many other approaches. It uses the tensor with the highest-levelled features as the starting point to combine it layer-by-layer with features of a lower abstraction level until it reaches the lowest level. We train this network on a modified PASCAL VOC 2012 dataset for object contour detection and evaluate on a refined PASCAL-val dataset reaching an excellent performance and an Optimal Dataset Scale (ODS) of 0.752. Furthermore, by fine-training on the BSDS500 dataset we reach state-of-the-art results for edge-detection with an ODS of 0.824.
Keywords:
Object Contour Detection Edge Detection Multi-Path Refinement CNN.
1 Introduction
Object contour detection extracts information about the object shape in images. Reliable detectors distinguish between desired object contours and edges from the background. Resulting object contour maps are very useful for supporting and/or improving various computer vision applications, like semantic segmentation [5, 31, 35], object proposal [24] and object flow estimation [25, 31].
Holistically-Nested Edge Detection (HED) [32] has shown that it is beneficial to use features of a pre-trained classification network to capture desired image boundaries and suppressing undesired edges. Khoreva et al. [13] have specifically trained the HED on object contour detection and proven the potential of HED for this task. Yang et al. have used a Fully Convolutional Encoder-Decoder Network (CEDN) to produce contour maps, in which the object contours of certain object classes are highlighted and other edges are suppressed more effectively than before [34]. Convolutional Oriented Boundaries (COB) [21] outperforms these results by using multi-scale oriented contours derived from a HED-like network architecture together with an efficient hierarchical image segmentation algorithm. A common feature in all this work is that a Very Deep Convolutional Network for Large-Scale Image Recognition (VGG) [28] and its classifying ability is used as a backbone network. It is obvious that this backbone and its effective use are the major keys to the results achieved, but the new methods mentioned here do not use the latest classification networks, like Deep Residual Learning for Image Recognition (ResNet) [12], which show a higher classification ability than VGG. We use a ResNet as backbone and propose a strategy to prioritise the effective utilization of the high-level abstraction capability for object contour detection. Accordingly we choose a fitting architecture and a customized training procedure. We outperform the methods mentioned previously and achieve a very robust detector with an excellent performance on the validation data of a refined PASCAL VOC [10]. High-level edge detection is closely related to object contour detection, because object contours are often an important subset of the desired detection. Continuing, we will introduce the edge detection task and show that, unlike object contour detection, there is unexploited potential for using the high abstraction capability of classification networks.
Edge detection has a rich history and – as one of the classic vision problems – plays a role in almost any vision pipeline with applications like optical flow estimation [17], image segmentation [7] or generative image inpainting [22, 33]. Classical low-level detectors, such as Canny [6] and Sobel [29], or the recently applied edge-detection with the Phase Stretch Transform [2], filter the entire image and do not distinguish between semantic edges and the rest. Edge detection is no longer limited only to low-level computer vision problems. Even the evaluation method established to date – the Berkeley Segmentation Dataset and the Benchmark 500 (BSDS500) [1] – requires high-level image processing algorithms for good results. Before Convolutional Neural Networks (CNNs) became popular, algorithms like the gPb [1], which uses contour detection together with hierarchical image segmentation, reached impressive results. In recent years, edge detectors, such as DeepNet [14] and N4-Fields [11] have begun to use operations from CNNs to reach a higher-level detection. DeepEdge [3] and DeepContour [27] are CNN applications that use more high-level features to extract contours, and show that this capability improves the detection of certain edges. HED uses higher abstraction abilities than previous methods by combining multi-scale features and multi-level features extracted of a pre-learned CNN, and improves edge detection. Latest edge detectors such as the Crisp Edge Detector (CED) [31], Richer Convolutional Features (RCF) [20], COB and a High-for-Low features algorithm (HFL) [4] make use of their backbone classification nets for edge detection in different ways. But some of these networks are based on older backbone CNNs like the VGG and/or use simple HED-like skip-layer architectures. Similarly for object contour detection, we assume recent work has unexploited potential in the utilization of pre-trained classification abilities, in terms of architecture, backbone network, training procedure and datasets. We contribute a simple but decisive strategy, a network architecture choice following our strategy and unconventional training methods to reach state-of-the-art.
Section 2 will briefly summarize the closest related work, Section 3 contains main contributions, like concept, realization and special training procedures for the proposed detector, Section 4 compares the method with other relevant methods and Section 5 concludes the paper.
2 Related Work
AlexNet [16] was a breakthrough for image classification and was extended to solve other computer vision tasks, such as image segmentation, object contour, and edge detection. The step from image classification to image segmentation with the Fully Convolutional Network (FCN) [26] has favored new edge detection algorithms such as HED, as it allows a pixel-wise classification of an image. HED has successfully used the logistic loss function for the edge or non-edge binary classification. Our approach uses the same loss function, but differs in term of another weighting factor, network architecture, and backbone network. Another image segmentation network, Learning Deconvolution Network for Semantic Segmentation [23], has favored the development of the CEDN, demonstrating the strong relationship between image segmentation, object contour detection and edge detection. The good results of the CEDN inspired us to consider recent image segmentation networks for our task. Yang et al. created a new contour dataset using a Conditional-Random-Fields (CRF) [15] refining method. CEDN and edge detector networks such as COB and HFL have an older backbone net and are outperformed by RCF, which is based on a ResNet and improved the edge detection. RCF has the same backbone network, but differs from our approach in using a different network architecture because it uses a skip-layer structure for feature concatenation like HED. We state that this simple concatenation is not effective enough for edge detection, and we propose to use a more advanced network structure. We have the hypothesis that an effective network architecture for edge detection should prioritise the high abstraction capability itself. As the deepest feature maps are the next ones to the classification layer, we propose to use them as the starting point to refine them layer-by-layer with features of a lower level until it reaches the level of classical edge detection algorithms. Our required properties are combined in RefineNet [18] and that is why we have used the publicly available code from Guosheng Lin et al. as the basis of our approach. Parallelly to the implementation of our method, the CED from the work Learning to Predict Crisp Boundaries from Deng et al. [8] has used a similar bottom-up architecture and surpasses RCF and achieves state-of-the-art. The work from Wang et al. Deep Crisp Boundaries [31] further develops this method and improved state-of-the-art results. Our approach mainly differs from theirs in its conceptualization. They also focus on producing ”crisp” (thinned) boundaries, as they have shown that this benefits their results. We assume in contrast, that by focusing on the effective utilization of the high abstraction capability of a backbone network, we could achieve better results.
3 RefineContourNet
3.1 Concept
Detecting edges with classical low-level methods can visualize the high amount of edges in many images. To distinguish between meaningful edges and undesired edges, a semantic context is required. Our selected contexts are the object contours of the 20 classes of the PASCAL VOC dataset. If context is clear and some low-level vision functions are available, the most important ability for an object contour detector is the high-level abstraction capability, so that edges can be distinguished in the sense of context. For this reason, our concept focuses on the effective use of the high-level abstraction ability of a modern classification network for object contour detection. With this strategy, we choose the architecture, backbone network, training procedure and datasets.
We hypothesize that an effective edge detection network architecture should prioritize the above mentioned capability. For this we propose to give preference to the deepest feature maps of the backbone network and to use them as the starting point in a refinement architecture. To connect the high-level classification ability with the pixel-wise detection stage, we assume, that a step-by-step refinement, where deep features are fused with features of the next shallower level until the shallowest level is reached, should be more effective than skip-layers with a simple feature concatination architecture. In most classification networks, features of different abstraction levels have different resolutions. To merge these features, a multi-resolution fusion is necessary. The RefineNet from Lin et al. [18] provides the desired multi-path refinement and we base our application upon that and name our application in reference to this network RefineContourNet (RCN).
The training procedure has to accomplish two main goals: To effectively use the pre-trained features to form a specific abstraction capability for identifying desired object contours learned from data on the one hand and to connect this to the pixel-wise detection stage on the other hand. Because training data of object contours is limited, both training goals can be enhanced with data augmentation methods. For a similar reason, we do some experiments with a modified Microsoft Common Objects in Context (COCO) [19] dataset, to create an additional object contour dataset, usable for a pre-training. For fine-training on edge-detection, we offer a simple and unconventional training method that considers the individuality of BSDS500’s hand-drawn labels.
3.2 Image Segmentation Network for Contour Detection
The main difference between an image segmentation network and a contour detection network lies in the definition of the objective function. Instead of defining a multi-label segmentation, an object contour can be defined binary. We use the logistic regression loss function
[TABLE]
with , and , where is the prediction for a pixel with the corresponding binary label . symbolizes the learned parameters and is a weighting factor for enhancing the contour detection due to the large imbalance between the contour and the non-contour pixels. Changing the loss function results in a change of the last layer of the RefineNet according to the binary loss function. The 21 feature map layers previously used to segment 20 PASCAL-VOC classes, including background class, will be replaced by a single feature map sufficient for binary classification of contours.
3.3 Network Architecture
Figure 1 shows the RefineContourNet. For clarity, the connections between the blocks specifies the resolution of the feature maps and the size of the feature channel dimension. The Residual Blocks (RB) are part of the ResNet-101. RefineNet has introduced three different refinement path blocks: The Residual Convolution Unit (RCU), the Multi-Resolution Fusion (MRF) and the Chained Residual Pooling (CRP). They are arranged in a row to use the higher-level features as input to combine them with the lower-level features of the RB at the same level. The RCU in Fig. 2 (a) has residual convolutional layers and enriches the network with more parameters. It can adjust and modify the input for MRF. The MRF block adapts the input first by performing a convolution operation, in order to adjust the channel dimension of the feature space corresponding to the higher-level ones with the lower level. Then, the smaller resolution feature maps get upsampled to have same tensor dimensions as the larger ones, after which they are added, as shown in Fig. 2 (b). The goal of the CRP is to gather more context from the feature maps than a normal max pooling layer. Several pooling blocks are concatenated and each block consists of a max-pooling with higher stride length and a convolutional operation. Illustration of CRP with two max-pooling operations is shown in Fig. 2 (c). In the final refinement step, we use the original image as input for an extra path with 3 RCUs, which improves the results.
4 Evaluation
We have done various experiments with different combinations of refinement blocks per multipath, and we have always observed the best results by placing the three blocks sequentially in a row, as shown in Fig. 1.
4.1 Training
For each epoch, 1000 random images are selected from the training set, and the data is augmented by random cropping, vertical flipping, and scaling between 0.7 and 1.3. To find an optimal training method, we have examined the following training variants:
- •
RCN-VOC is trained only on the CRF-refined object contour dataset proposed by Yang et al. [34].
- •
RCN-COCO is pre-trained on a modified COCO dataset, where we have considered only the 20 PASCAL VOC classes and have produced contours. COCO segmentation masks and from those generated contours are not accurate, so we enrich them with additional contours. For this we use our own object contour detector RCN-VOC and set a high threshold to add only confident contour detections.
- •
RCN is pre-trained on the modified COCO and trained on the refined PASCAL VOC.
Training the network for edge detection involves fine-training on the validation and train sets of the BSDS500 dataset. The BSDS contains individual, hand-drawn contours for the same images created by different people. We take the subjective decisions into account of which edge is a desired edge and which is not, by simply using all individual labels, and let the CNN form a compromise. To give an indication of how such training affects the results, we fine-train one of the networks only on the drawings of one single person, called RCN-VOC-1. All trainings and modifications are done in MatConvNet [30].
4.2 Object Contour Detection Evaluation
For evaluation, we use Piotr’s Computer Vision Matlab Toolbox [9], the included Non-Maximum-Suppression (NMS) algorithm for thinning the soft object contour maps and a subset of 1103 images of a CRF-refined PASCAL val2012. We calculate the Precision and Recall (PR) curve for the RCN models, CEDN, HED and COB in Fig.5. In Tab. 5 the Optimal Dataset Scale (ODS), Optimal Image Scale (OIS) and the Average Precision (AP) for the methods are noted. The quantitative analysis reveals that the RCN models significantly perform better in comparison to the other methods on all three metrics. This is also reflected in the visual results, cf. Fig. 3. The RCN-VOC and RCN have upper hand in suppressing the undesired edges, such as inner contours of the objects. At the same time, they also can recognize object contours more clearly. A disadvantage is that the contour predictions are thicker than in CEDN, which is due to the halved resolution owed by the network architecture. Nevertheless, the detection is very robust and the NMS can effectively calculate 1-pixel thinned object contours.
4.3 Edge Detection Evaluation
The results of a quantitative evaluation of the RCN on unseen test images of BSDS500 are represented in the PR-curves in Fig 8. ODS, OIS and AP from methods such as RCN, CED, RCF, COB and HED are listed in Tab. 8. The proposed RCN achieves the state-of-the-art with a higher ODS than recent methods, closely followed by CED. In Fig. 6 results of CED and RCN are visualized for some test images. Careful analysis of the results reveals that RCN detects some relevant edges, cf. inner contours of the snowshoes (1st row), face of the young man (2nd row), snout of the llama (4th row), which CED no longer recognizes. As for the object contour detection task, the disadvantage of the thicker edge predictions persists for the RCN. However, the NMS works more precisely for edge prediction maps from RCN, as the bit depth per pixel is increased from 8 to 16 bits. The difference is evident in the results for background edges in the image of the young man (2nd row), since an absolute maximum of the CED prediction could not be clearly distinguished, RCN edges are thinned more effectively.
5 Conclusion
The strategy, of using the high abstraction capability for object contour and edge detection more effectively than previous methods, has given us very good results in object contour detection and state-of-the-art results in edge detection. Our concept that RefineNet [18] provides a very useful bottom-up multipath refinement architecture for edge detection is supported by these results. With the unconventional training methods, like the pre-training with a modified COCO dataset or by simply using all individual labels for fine-training on BSDS500, we have been able to improve the respective task.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (5), 898–916 (May 2011). https://doi.org/10.1109/TPAMI.2010.161
- 2[2] Asghari, M.H., Jalali, B.: Physics-inspired image edge detection. In: 2014 IEEE Global Conference on Signal and Information Processing (Global SIP). pp. 293–296 (Dec 2014). https://doi.org/10.1109/Global SIP.2014.7032125
- 3[3] Bertasius, G., Shi, J., Torresani, L.: Deepedge: A multi-scale bifurcated deep network for top-down contour detection. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4380–4389 (June 2015). https://doi.org/10.1109/CVPR.2015.7299067
- 4[4] Bertasius, G., Shi, J., Torresani, L.: High-for-low and low-for-high: Efficient boundary detection from deep object features and its applications to high-level vision. In: 2015 IEEE International Conference on Computer Vision (ICCV). pp. 504–512 (Dec 2015). https://doi.org/10.1109/ICCV.2015.65
- 5[5] Bertasius, G., Shi, J., Torresani, L.: Semantic segmentation with boundary neural fields. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3602–3610 (June 2016). https://doi.org/10.1109/CVPR.2016.392
- 6[6] Canny, J.: A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence (6), 679–698 (1986)
- 7[7] Chen, L., Barron, J.T., Papandreou, G., Murphy, K., Yuille, A.L.: Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4545–4554 (June 2016). https://doi.org/10.1109/CVPR.2016.492
- 8[8] Deng, R., Shen, C., Liu, S., Wang, H., Liu, X.: Learning to predict crisp boundaries. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. pp. 570–586. Springer International Publishing, Cham (2018)
