IvaNet: Learning to jointly detect and segment objets with the help of Local Top-Down Modules
Shihua Huang, Lu Wang

TL;DR
IvaNet is a multi-task framework that enhances object detection and segmentation by using local top-down modules to better integrate semantic information across layers, improving robustness.
Contribution
IvaNet introduces local top-down modules for joint detection and segmentation, addressing robustness issues in previous full top-down approaches.
Findings
Demonstrated improved performance on PASCAL VOC
Achieved competitive results on MS COCO
Validated effectiveness of local top-down modules
Abstract
Driven by Convolutional Neural Networks, object detection and semantic segmentation have gained significant improvements. However, existing methods on the basis of a full top-down module have limited robustness in handling those two tasks simultaneously. To this end, we present a joint multi-task framework, termed IvaNet. Different from existing methods, our IvaNet backwards abstract semantic information from higher layers to augment lower layers using local top-down modules. The comparisons against some counterparts on the PASCAL VOC and MS COCO datasets demonstrate the functionality of IvaNet.
| Method | 300300 | 512512 | FPS (300512) |
|---|---|---|---|
| IvaNetNO | 70.3/72.9 | 77.1/75.2 | - |
| IvaNetdet | 76.2/- | 79.3/- | 36.525.5 |
| IvaNetseg | -/74.2 | -/76.2 | 44.631.5 |
| IvaNet | 76.2/75.9 | 79.8/76.8 | 32.524.5 |
| Method | Backbone | VOC2007† | VOC2012 |
| detectors: | |||
| Faster RCNN [10] | VGGNet16 | 73.2/- | 70.4/- |
| SSD300 [5] | VGGNet | 77.2/- | 75.8/- |
| DSSD321 [1] | ResNet101 | 78.6/- | 76.3/- |
| LTD-SSD300 [4] | VGGNet | 79.4/- | 76.7/- |
| SSD512 [5] | VGGNet | 79.8/- | 78.5/- |
| DSSD513 [1] | ResNet101 | 81.5/- | 80.0/- |
| LTD-SSD512 [4] | VGGNet | 81.8/- | 79.7/- |
| segmentations: | |||
| FCN [7] | VGGNet | -/- | -/62.2 |
| DeconvNet [2] | VGGNet | -/- | -/69.6 |
| Deeplab-v2 [12] | VGGNet | -/69.0 | -/- |
| GCN+BR [20] | ResNet50 | -/72.3 | -/- |
| GCN+BR [20] | ResNet101 | -/74.7 | -/- |
| multi-task: | |||
| BlitzNet300 [3] | ResNet50 | 78.7/75.3* | 76.7/75.7 |
| IvaNet300 | ResNet50 | 78.5/75.5* | 76.2/75.9 |
| BlitzNet512 [3] | ResNet50 | 81.5/75.7* | 79.7/76.7 |
| IvaNet512 | ResNet50 | 81.4/76.9* | 79.8/76.8 |
| Method | minival (AP/mIoU) |
|---|---|
| BlitzNet300 [3] | 29.7/52.8 |
| IvaNet300 | 29.7/55.0 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
IvaNet: Learning to jointly detect and segment objets with the help of Local Top-Down Modules
Abstract
Driven by Convolutional Neural Networks, object detection and semantic segmentation have gained significant improvements. However, existing methods on the basis of a full top-down module have limited robustness in handling those two tasks simultaneously. To this end, we present a joint multi-task framework, termed IvaNet. Different from existing methods, our IvaNet backwards abstract semantic information from higher layers to augment lower layers using local top-down modules. The comparisons against some counterparts on the PASCAL VOC and MS COCO datasets demonstrate the functionality of IvaNet.
**Index Terms— ** Object detection, Semantic segmentation, Local Top-Down module, Multi-task framework.
1 Introduction
Object detection and semantic segmentation play pivotal roles in image understanding that can be applied to numerous applications, e.g. automated driving. Driven by the Convolutional Neural Networks (CNNs), those two tasks have gained significant improvement. Due to the existence of pooling and stride-convolution layers in bottom-up network, the resolutions of higher layers is smaller. As a result, those layers contain more abstract semantic information about objects since they have larger receptive field, but they will lose the spatial information that good for locating gradually.
In order to better solve object detection and semantic segmentation tasks that require semantic information as well as spatial information, a number of full top-down modules are proposed to backward the semantic information from higher layers into lower layers where there is much more detailed information [1, 2, 3]. Despite that these methods have achieved encouraging results, they may introduce much useless information into some lower layers, which may degrade their performance. It can be observed from Fig. 1 that the information of the plant after those two persons vanishes and is covered by larger objects (persons) gradually as the receptive field becomes larger, which means the information backward from some higher layers is meaningless for some lower layers that are used to detect and segment small objects. To this end, a local top-down module (LTD) [4] has been proposed and demonstrated its functionality for single shot object detector [5] recently.
This paper studies to apply the LTD module to a multi-task framework, called IvaNet, which is designed for effectively detecting and segmenting objects simultaneously. To be specifical, we adopt a LTD module to integrate the information from two succeeding convolutional layers for each lower layer. In this way, we can construct a local top-down network on the basis of a deep bottom-up neural network, ImageNet pre-trained ResNet50 [6] is used in this study, and two task-specific heads are builded on the top of the top-down network. We note that the idea behind our detector is the same with SSD [5] while the segmentation head is a simple FCN [7]. Extensive experimental results on the PASCAL VOC and MS COCO datasets show that the local top-down achieves superior results on the task of semantic segmentation, which demonstrates its functionality. Moreover the proposed multi-task learning improves the performance of each task.
2 Related work
2.1 Object Detection
Object detection aims at locating objects with bounding boxes and classifying them into corresponding class. The existing object detectors can be categorized into two branches, R-CNN based [8, 9, 10] and SSD based [5, 1, 4]. R-CNN based methods use a third part method to pre-select anchors, for instance, R-CNN [8] use the Selective Search [11]. Compared with R-CNNs, Single Shot Detector (SSD) [5] is fast and robust to multi-scale object detection. There are some variants of SSD for improving the robustness by introducing the abstract semantic information from higher layers into lower ones via a top-down network. Instead using a full top-down module that will introduce much useless information as DSSD [1], LTD-SSD [4] proposes a local top-down module and achieves better results. Specifically, each prediction layer is integrated only with the upsampled features from its two succeeding layers.
2.2 Semantic Segmentation
The task of semantic segmentation is learning to classify each pixel of the input image into corresponding class. FCN [7] is one of the pioneers that extended the convolutional model used for image-level classification to per-pixel classification by replacing all fully connected layers with convolutional layers. Instead using a single bilinear interpolation layer to upsample the segmentation results to original size, DeconvNet [2] opts to use a deep learnable deconvolution network. Besides, DeepLab V2 [12] proposes a Atrous Spatial Pyramid Pooling module that can build a spatial pyramid without changing the resolutions of the feature maps. While PSPNet[13] adopts a pyramid pooling module to aggregate contextual information from different regions.
2.3 Multi-task learning
A number of methods for multi-task learning have emerged after UberNet [14], which enables 7 computer vision tasks can be handled simultaneously with a single complex model. Mask R-CNN [15] augments the Faster R-CNN with a instance segmentation prediction branch and shows compelling results on object detection and instance segmentation. Different from Mask R-CNN, the BlitzNet [3] is a SSD-based multi-task framework for object detection and semantic segmentation that maintains a far superior speed.
3 IvaNet
3.1 Architecture
The overall architecture is shown in Fig. 2, IvaNet consists of a shared convolutional feature extraction network and two task-specific heads on top of it.
Shared Network: We use some ImageNet pre-trained layers from the ResNet50 [6] to construct a base bottom-up network. Specifically, the parameters after block4 all are dropped. Furthermore, the same Local top-down module (LTD) presented in LTD-SSD [4] is adopted to construct a top-down feature pyramid. We note that the output channel of eah LTD is restricted to 384 by convolutional layer, which is different from LTD-SSD.
Object Detection Head: There are a number of Bounding Box Prediction (BBbox Prediction) branches with the identical network structures in Object Detection Head. They are used to work at feature pyramid for multi-scale objects detection efficiently. Each BBox Prediction contains a paired of detection-specific classification and location regression layer. To be specific, they both are a convolutional layer whose kernel size is with output channel = and for classification and location regression, respectively, where is the number of object classes, e.g. it is 21 for PASCAL VOC dataset (one is background), and it used as a consistent denotation in the following unless otherwise specified, while denotes the number of default anchors in each cell, and represents two coordinates of each anchor. Furthermore, we adopt the hard negative example mining to balance the ratio between positive and negative examples. Finally, the Non-Maximum Suppression (NMS) is employed as the post-processing method to eliminate redundant detection results.
Semantic Segmentation Head: It can be observed that the semantic segmentation head includes a Pyramid Convolutional Module (PCM) and a mask prediction branch. Instead using different pooling kernel sizes and strides to get multi-scale feature maps as Pyramid Pooling Module in PSPNet [13], we use the pre-computed feature pyramid from the local top-down network. After that, they will be upsampled to pre-defined size and concatenated. Before be used to predicted the semantic mask with the mask prediction branch, the concatenation will be input to a convolutional layer. The mask prediction branch only has a convolutional layer with kernel size and output channel is set to and , respectively.
3.2 Objective
For detection, we use the same objective presented in SSD as follow,
[TABLE]
where is a classification loss that defined as the softmax loss over multiple classes scores , and the localization loss is a Smooth L1 loss between the predicted box and the groundtruth box parameters, and denotes the number of samples that used as the same meaning in the following formula 3.2. While the segmentation loss is the cross-entropy between predicted and the target class distribution of pixels:
[TABLE]
where is a set of samples, and is the target while is the predicted. The final objective is defined as,
[TABLE]
where is a hyper-parameter that controls the relative importance of segmentation loss compared to the detection loss, and set to 1.
3.3 Implementation Details
We implement our model with TensorFlow [16] and it will be available at https://github.com/Hshihua/IvaNet. The same data augmentation methods as SSD [5] is adopted, including random crop, horizontal flip, and so on. For all experiments, the Adam optimizer [17] with is used to train our IvaNet, and the initial learning rate is set to 10*-4*, which will be decreased twice by a factor 10. Moreover, the mini-batch size is set to 32 or 16 when the resolution of the input image is 300 or 512, respectively.
4 Experiments
In this section, we first introduce the datasets and metrics used in this paper simply. Afterwards, the functionality of our proposed IvaNet is validated by the comparisons with some counterparts on various datasets. Finally, there are some analyses about the possible reasons behind of our failures.
4.1 Datasets and metrics
PASCAL VOC. The VOC2007 and VOC2012 are two of active datasets in detection and segmentation tasks. Both datasets have thousands of images over 20 object classes. Since there are only a set of images from VOC2012 are annotated with semantic masks in both datasets, which is less effective for semantic segmentation evaluation. We augment them with extra annotations provided by [18], denoted as VOC2012 train-aug.
Microsoft COCO. The MS COCO dataset [19] includes 80 categories of objects for object detection and instance segmentation. There are hundreds of thousands of annotated images. To get the semantic segmentation annotations from the given instance segmentation annotations, we use the tool provided by Nikita Dvornik et.al. [3].
The quality of predicted segmentation masks is measured with mean Intersection over Union (mIoU) in all datasets while we evaluate the detection results from PASCAL VOC or MS COCO with mean Average Precision (mAP) or AP0.5:0.95 (AP, for simply), respectively.
4.2 Results on the PASCAL VOC
For all experiments on PASCAL VOC datasets, the max training iterations are set to 65k and 75k for models with input size as 300300 and 512512, respectively. They decrease their initial learning rate at 35k and 45k steps, respectively, and the learning rates both are decreased anther time after 15k steps.
Ablation Study. This part demonstrates the effectiveness of the LTD modules and validates a multi-task model is better than the single one. Three variants of the IvaNet that alate LTD modules, segmentation head and detection head, called IvaNetNO, IvaNetdet and IvaNetseg, respectively. From Tab. 1, we can see that the IvaNet without LTD modules degrades its performance significantly on both tasks. Besides, the IvaNet improves the IvaNetdet and IvaNetseg with the max improvements are 0.5% and 1.7%, respectively. Furthermore, IvaNet only adds small extra time consumption to IvaNetdet after building the segmentation head along with the detection head.
Comparison. As shown in Tab. 2 obviously, our proposed IvaNet has achieved superior or comparable results when compared to some single-task models. Besides, we also compare the IvaNet with an existing multi-task model that exploits a full top-down module, BlitzNet. Our IvaNet has nearly the same performance as BlitzNet [3] on detection task, but IvaNet has achieved much better results on segmentation task and outperforms by 1.1%.
4.3 Results on the MS COCO
All models are trained on the trainval35k for 700k iterations, and the learning rates are decreased at 400k and 500k steps. Different from the PASCAL VOC, all trained images are both annotated with bounding boxes and semantic mask that is derived from the public instance mask. It seems that the proposed IvaNet consistently outperforms the BlitzNet significantly on segmentation task from Tab. 3, while keeps comparable results on detection task.
4.4 Visual Illustration
From the visual results in Fig. 3, we can observe that the BlitzNet is ineffective to segment small objects, we argue the reason for such failures is that much meaningless information is integrated into lower layers by the full top-down module, while the IvaNet can avoid this case and performs better.
4.5 Limitation
Despite that the proposed IvaNet has achieved the compelling results in many cases, it still has some limitations. For instance, IvaNet can detect some objects but fail to segment them as shown in Fig. 4. We argue that the reason behind such failure case is that the limited context is introduced by our model, as semantic segmentation requires more context than object detection.
5 CONCLUSION
We present a multi-task framework IvaNet for effectively solving object detection and semantic segmentation in an efficient way. With the help of LTD modules, IvaNet has achieved superior or comparable results on the PASCAL VOC and MS COCO datasets. Due to the limited context of our IvaNet, we are going to adopting some atrous convolutional layers in the future.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg, “Dssd: Deconvolutional single shot detector,” ar Xiv preprint ar Xiv:1701.06659 , 2017.
- 2[2] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE international conference on computer vision , 2015, pp. 1520–1528.
- 3[3] Nikita Dvornik, Konstantin Shmelkov, Julien Mairal, and Cordelia Schmid, “Blitznet: A real-time deep network for scene understanding,” in ICCV 2017-International Conference on Computer Vision , 2017, p. 11.
- 4[4] Shihua Huang, Lu Wang, Peiyu Yang, and Qingxu Deng, “A local top-down module for object detection with multi-scale features,” in Chinese Conference on Pattern Recognition and Computer Vision (PRCV) . Springer, 2018, pp. 65–77.
- 5[5] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg, “SSD: Single shot multibox detector,” in European conference on computer vision . Springer, 2016, pp. 21–37.
- 6[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778.
- 7[7] Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3431–3440.
- 8[8] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2014, pp. 580–587.
