Feature Selective Anchor-Free Module for Single-Shot Object Detection

Chenchen Zhu; Yihui He; Marios Savvides

arXiv:1903.00621·cs.CV·March 5, 2019

Feature Selective Anchor-Free Module for Single-Shot Object Detection

Chenchen Zhu, Yihui He, Marios Savvides

PDF

4 Repos

TL;DR

This paper introduces the FSAF module, an anchor-free component for single-shot object detection that improves accuracy and speed by dynamic feature selection and can work alongside anchor-based methods.

Contribution

The paper proposes a novel feature selective anchor-free module that enhances single-shot detectors with dynamic feature assignment and joint anchor-based integration.

Findings

01

FSAF outperforms anchor-based methods in accuracy and speed.

02

Joint training with FSAF and anchor-based branches improves detection performance.

03

Achieves 44.6% mAP on COCO, surpassing existing single-shot detectors.

Abstract

We motivate and present feature selective anchor-free (FSAF) module, a simple and effective building block for single-shot object detectors. It can be plugged into single-shot detectors with feature pyramid structure. The FSAF module addresses two limitations brought up by the conventional anchor-based detection: 1) heuristic-guided feature selection; 2) overlap-based anchor sampling. The general concept of the FSAF module is online feature selection applied to the training of multi-level anchor-free branches. Specifically, an anchor-free branch is attached to each level of the feature pyramid, allowing box encoding and decoding in the anchor-free manner at an arbitrary level. During training, we dynamically assign each instance to the most suitable feature level. At the time of inference, the FSAF module can work jointly with anchor-based branches by outputting predictions in parallel.…

Tables3

Table 1. Table 1 : Ablative experiments for the FSAF module on the COCO minival . ResNet-50 is the backbone network for all experiments in this table. We study the effect of anchor-free branches, heuristic feature selection, and online feature selection.

Anchor- based branches

Anchor-free branches

AP

AP₅₀

AP₇₅

AP_S

AP_M

AP_L

Heuristic feature

selection Eqn. (3)

Online feature

selection Eqn. (2)

RetinaNet

✓

35.7

54.7

38.5

19.5

39.9

47.5

Ours

✓

34.7

54.0

36.4

19.0

39.0

45.8

✓

35.9

55.0

37.9

19.8

39.6

48.2

✓

36.1

55.6

38.7

19.8

39.7

48.9

✓

37.2

57.2

39.4

21.0

41.2

49.7

Table 2. Table 2 : Detection accuracy and inference latency with different backbone networks on the COCO minival . AB : Anchor-based branches. R : ResNet. X : ResNeXt.

Backbone

Method

AP

AP₅₀

Runtime

(ms/im)

R-50

RetinaNet

35.7

54.7

131

Ours(FSAF)

35.9

55.0

107

Ours(AB+FSAF)

37.2

57.2

138

R-101

RetinaNet

37.7

57.2

172

Ours(FSAF)

37.9

58.0

148

Ours(AB+FSAF)

39.3

59.2

180

X-101

RetinaNet

39.8

59.5

356

Ours(FSAF)

41.0

61.5

288

Ours(AB+FSAF)

41.6

62.4

362

Table 3. Table 3 : Object detection results of our best single model with the FSAF module vs. state-of-the-art single-shot and multi-shot detectors on the COCO test-dev .

Multi-shot detectors
Method	Backbone	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
CoupleNet [42]	ResNet-101	34.4	54.8	37.2	13.4	38.1	50.8
Faster R-CNN+++ [28]		34.9	55.7	37.4	15.6	38.7	50.9
Faster R-CNN w/ FPN [21]		36.2	59.1	39.0	18.2	39.0	48.2
Regionlets [35]		39.3	59.8	n/a	21.7	43.7	50.9
Fitness NMS [31]		41.8	60.9	44.9	21.5	45.0	57.5
Cascade R-CNN [3]		42.8	62.1	46.3	23.7	45.5	55.2
Deformable R-FCN [4]	Aligned-Inception-ResNet	37.5	58.0	n/a	19.4	40.1	52.5
Soft-NMS [2]	Aligned-Inception-ResNet	40.9	62.8	n/a	23.3	43.6	53.3
Deformable R-FCN + SNIP [30]	DPN-98	45.7	67.3	51.1	29.3	48.8	57.1
Single-shot detectors
YOLOv2 [27]	DarkNet-19	21.6	44.0	19.2	5.0	22.4	35.5
SSD513 [24]	ResNet-101	31.2	50.4	33.3	10.2	34.5	49.8
DSSD513 [8]		33.2	53.3	35.2	13.0	35.4	51.1
RefineDet512 [37] (single-scale)		36.4	57.5	39.5	16.6	39.9	51.4
RefineDet [37] (multi-scale)		41.8	62.9	45.7	25.6	45.1	54.1
RetinaNet800 [22]		39.1	59.1	42.3	21.8	42.7	50.2
GHM800 [18]		39.9	60.8	42.5	20.3	43.6	54.1
Ours800 (single-scale)		40.9	61.5	44.0	24.0	44.2	51.3
Ours (multi-scale)		42.8	63.1	46.5	27.8	45.5	53.2
CornerNet511 [17] (single-scale)	Hourglass-104	40.5	56.5	43.1	19.4	42.7	53.9
CornerNet [17] (multi-scale)	Hourglass-104	42.1	57.8	45.3	20.8	44.8	56.7
GHM800 [18]	ResNeXt-101	41.6	62.8	44.2	22.3	45.1	55.3
Ours800 (single-scale)		42.9	63.8	46.3	26.6	46.2	52.7
Ours (multi-scale)		44.6	65.2	48.6	29.7	47.1	54.6

Equations6

L_{F L}^{I} (l) L_{I o U}^{I} (l) = \frac{1}{N ( b _{e}^{l} )} i, j \in b_{e}^{l} \sum F L (l, i, j) = \frac{1}{N ( b _{e}^{l} )} i, j \in b_{e}^{l} \sum I o U (l, i, j)

L_{F L}^{I} (l) L_{I o U}^{I} (l) = \frac{1}{N ( b _{e}^{l} )} i, j \in b_{e}^{l} \sum F L (l, i, j) = \frac{1}{N ( b _{e}^{l} )} i, j \in b_{e}^{l} \sum I o U (l, i, j)

l^{*} = ar g l min L_{F L}^{I} (l) + L_{I o U}^{I} (l)

l^{*} = ar g l min L_{F L}^{I} (l) + L_{I o U}^{I} (l)

l^{'} = ⌊ l_{0} + lo g_{2} (w h /224)⌋

l^{'} = ⌊ l_{0} + lo g_{2} (w h /224)⌋

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAverage Pooling · ResNeXt Block · Random Horizontal Flip · Weight Decay · SGD with Momentum · Non Maximum Suppression · FSAF · Grouped Convolution · Feature Pyramid Network · Bottleneck Residual Block

Full text

Feature Selective Anchor-Free Module for Single-Shot Object Detection

Chenchen Zhu Yihui He Marios Savvides

Carnegie Mellon University

{chenchez, he2, marioss}@andrew.cmu.edu

Abstract

We motivate and present feature selective anchor-free (FSAF) module, a simple and effective building block for single-shot object detectors. It can be plugged into single-shot detectors with feature pyramid structure. The FSAF module addresses two limitations brought up by the conventional anchor-based detection: 1) heuristic-guided feature selection; 2) overlap-based anchor sampling. The general concept of the FSAF module is online feature selection applied to the training of multi-level anchor-free branches. Specifically, an anchor-free branch is attached to each level of the feature pyramid, allowing box encoding and decoding in the anchor-free manner at an arbitrary level. During training, we dynamically assign each instance to the most suitable feature level. At the time of inference, the FSAF module can work jointly with anchor-based branches by outputting predictions in parallel. We instantiate this concept with simple implementations of anchor-free branches and online feature selection strategy. Experimental results on the COCO detection track show that our FSAF module performs better than anchor-based counterparts while being faster. When working jointly with anchor-based branches, the FSAF module robustly improves the baseline RetinaNet by a large margin under various settings, while introducing nearly free inference overhead. And the resulting best model can achieve a state-of-the-art 44.6% mAP, outperforming all existing single-shot detectors on COCO.

1 Introduction

Object detection is an important task in the computer vision community. It serves as a prerequisite for various downstream vision applications such as instance segmentation [12], facial analysis [1, 39], autonomous driving cars [6, 20], and video analysis [25, 33]. The performance of object detectors has been dramatically improved thanks to the advance of deep convolutional neural networks [16, 29, 13, 34] and well-annotated datasets [7, 23].

One challenging problem for object detection is scale variation. To achieve scale invariability, state-of-the-art detectors construct feature pyramids or multi-level feature towers [24, 8, 21, 22, 19, 38]. And multiple scale levels of feature maps are generating predictions in parallel. Besides, anchor boxes can further handle scale variation [24, 28]. Anchor boxes are designed for discretizing the continuous space of all possible instance boxes into a finite number of boxes with predefined locations, scales and aspect ratios. And instance boxes are matched to anchor boxes based on the Intersection-over-Union (IoU) overlap. When integrated with feature pyramids, large anchor boxes are typically associated with upper feature maps, and small anchor boxes are associated with lower feature maps, see Figure 2. This is based on the heuristic that upper feature maps have more semantic information suitable for detecting big instances whereas lower feature maps have more fine-grained details suitable for detecting small instances [11]. The design of feature pyramids integrated with anchor boxes has achieved good performance on object detection benchmarks [7, 23, 9].

However, this design has two limitations: 1) heuristic-guided feature selection; 2) overlap-based anchor sampling. During training, each instance is always matched to the closest anchor box(es) according to IoU overlap. And anchor boxes are associated with a certain level of feature map by human-defined rules, such as box size. Therefore, the selected feature level for each instance is purely based on ad-hoc heuristics. For example, a car instance with size $50\times 50$ pixels and another similar car instance with size $60\times 60$ pixels may be assigned to two different feature levels, whereas another $40\times 40$ car instance may be assigned to the same level as the $50\times 50$ instance, as illustrated in Figure 2. In other words, the anchor matching mechanism is inherently heuristic-guided. This leads to a major flaw that the selected feature level to train each instance may not be optimal.

We propose a simple and effective approach named feature selective anchor-free (FSAF) module to address these two limitations simultaneously. Our motivation is to let each instance select the best level of feature freely to optimize the network, so there should be no anchor boxes to constrain the feature selection in our module. Instead, we encode the instances in an anchor-free manner to learn the parameters for classification and regression. The general concept is presented in Figure 3. An anchor-free branch is built per level of feature pyramid, independent to the anchor-based branch. Similar to the anchor-based branch, it consists of a classification subnet and a regression subnet (not shown in figure). An instance can be assigned to arbitrary level of the anchor-free branch. During training, we dynamically select the most suitable level of feature for each instance based on the instance content instead of just the size of instance box. The selected level of feature then learns to detect the assigned instances. At inference, the FSAF module can run independently or jointly with anchor-based branches. Our FSAF module is agnostic to the backbone network and can be applied to single-shot detectors with a structure of feature pyramid. Additionally, the instantiation of anchor-free branches and online feature selection can be various. In this work, we keep the implementation of our FSAF module simple so that its computational cost is marginal compared to the whole network.

Extensive experiments on the COCO [23] object detection benchmark confirm the effectiveness of our method. The FSAF module by itself outperforms anchor-based counterparts as well as runs faster. When working jointly with anchor-based branches, the FSAF module can consistently improve the strong baselines by large margins across various backbone networks, while at the same time introducing the minimum cost of computation. Especially, we improve RetinaNet using ResNeXt-101 [34] by 1.8% with only 6ms additional inference latency. Additionally, our final detector achieves a state-of-the-art 44.6% mAP when multi-scale testing are employed, outperforming all existing single-shot detectors on COCO.

2 Related Work

Recent object detectors often use feature pyramid or multi-level feature tower as a common structure. SSD [24] first proposed to predict class scores and bounding boxes from multiple feature scales. FPN [21] and DSSD [8] proposed to enhance low-level features with high-level semantic feature maps at all scales. RetinaNet [22] addressed class imbalance issue of multi-level dense detectors with focal loss. DetNet [19] designed a novel backbone network to maintain high spatial resolution in upper pyramid levels. However, they all use pre-defined anchor boxes to encode and decode object instances. Other works address the scale variation differently. Zhu et al [41] enhanced the anchor design for small objects. He et al [14] modeled the bounding box as Gaussian distribution for improved localization.

The idea of anchor-free detection is not new. DenseBox [15] first proposed a unified end-to-end fully convolutional framework that directly predicted bounding boxes. UnitBox [36] proposed an Intersection over Union (IoU) loss function for better box regression. Zhong et al [40] proposed anchor-free region proposal network to find text in various scales, aspect ratios, and orientations. Recently CornerNet [17] proposed to detect an object bounding box as a pair of corners, leading to the best single-shot detector. SFace [32] proposed to integrate the anchor-based method and anchor-free method. However, they still adopt heuristic feature selection strategies.

3 Feature Selective Anchor-Free Module

In this section we instantiate our feature selective anchor-free (FSAF) module by showing how to apply it to the single-shot detectors with feature pyramids, such as SSD [24], DSSD [8] and RetinaNet [22]. Without lose of generality, we apply the FSAF module to the state-of-the-art RetinaNet [22] and demonstrate our design from the following aspects: 1) how to create the anchor-free branches in the network (3.1); 2) how to generate supervision signals for anchor-free branches (3.2); 3) how to dynamically select feature level for each instance (3.3); 4) how to jointly train and test anchor-free and anchor-based branches (3.4).

3.1 Network Architecture

From the network’s perspective, our FSAF module is surprisingly simple. Figure 4 illustrates the architecture of the RetinaNet [22] with the FSAF module. In brief, RetinaNet is composed of a backbone network (not shown in the figure) and two task-specific subnets. The feature pyramid is constructed from the backbone network with levels from $P_{3}$ through $P_{7}$ , where $l$ is the pyramid level and $P_{l}$ has $1/2^{l}$ resolution of the input image. Only three levels are shown for simplicity. Each level of the pyramid is used for detecting objects at a different scale. To do this, a classification subnet and a regression subnet are attached to $P_{l}$ . They are both small fully convolutional networks. The classification subnet predicts the probability of objects at each spatial location for each of the $A$ anchors and $K$ object classes. The regression subnet predicts the 4-dimensional class-agnostic offset from each of the $A$ anchors to a nearby instance if exists.

On top of the RetinaNet, our FSAF module introduces only two additional conv layers per pyramid level, shown as the dashed feature maps in Figure 4. These two layers are responsible for the classification and regression predictions in the anchor-free branch respectively. To be more specific, a $3\times 3$ conv layer with $K$ filters is attached to the feature map in the classification subnet followed by the sigmoid function, in parallel with the one from the anchor-based branch. It predicts the probability of objects at each spatial location for $K$ object classes. Similarly, a $3\times 3$ conv layer with four filters is attached to the feature map in the regression subnet followed by the ReLU [26] function. It is responsible for predicting the box offsets encoded in an anchor-free manner. To this end the anchor-free and anchor-based branches work jointly in a multi-task style, sharing the features in every pyramid level.

3.2 Ground-truth and Loss

Given an object instance, we know its class label $k$ and bounding box coordinates $b=[x,y,w,h]$ , where $(x,y)$ is the center of the box, and $w,h$ are box width and height respectively. The instance can be assigned to arbitrary feature level $P_{l}$ during training. We define the projected box $b_{p}^{l}=[x_{p}^{l},y_{p}^{l},w_{p}^{l},h_{p}^{l}]$ as the projection of $b$ onto the feature pyramid $P_{l}$ , i.e. $b_{p}^{l}=b/2^{l}$ . We also define the effective box $b_{e}^{l}=[x_{e}^{l},y_{e}^{l},w_{e}^{l},h_{e}^{l}]$ and the ignoring box $b_{i}^{l}=[x_{i}^{l},y_{i}^{l},w_{i}^{l},h_{i}^{l}]$ as proportional regions of $b_{p}^{l}$ controlled by constant scale factors $\epsilon_{e}$ and $\epsilon_{i}$ respectively, i.e. $x_{e}^{l}=x_{p}^{l},y_{e}^{l}=y_{p}^{l},w_{e}^{l}=\epsilon_{e}w_{p}^{l},h_{e}^{l}=\epsilon_{e}h_{p}^{l}$ , $x_{i}^{l}=x_{p}^{l},y_{i}^{l}=y_{p}^{l},w_{i}^{l}=\epsilon_{i}w_{p}^{l},h_{i}^{l}=\epsilon_{i}h_{p}^{l}$ . We set $\epsilon_{e}=0.2$ and $\epsilon_{i}=0.5$ . An example of ground-truth generation for a car instance is illustrated in Figure 5.

Classification Output: The ground-truth for the classification output is $K$ maps, with each map corresponding to one class. The instance affects $k$ th ground-truth map in three ways. First, the effective box $b_{e}^{l}$ region is the positive region filled by ones shown as the white box in “car” class map, indicating the existence of the instance. Second, the ignoring box excluding the effective box ( $b_{i}^{l}-b_{e}^{l}$ ) is the ignoring region shown as the grey area, which means that the gradients in this area are not propagated back to the network. Third, the ignoring boxes in adjacent feature levels ( $b_{i}^{l-1}$ , $b_{i}^{l+1}$ ) are also ignoring regions if exists. Note that if the effective boxes of two instances overlap in one level, the smaller instance has higher priority. The rest region of the ground-truth map is the negative (black) area filled by zeros, indicating the absence of objects. Focal loss [22] is applied for supervision with hyperparameters $\alpha=0.25$ and $\gamma=2.0$ . The total classification loss of anchor-free branches for an image is the summation of the focal loss over all non-ignoring regions, normalized by the total number of pixels inside all effective box regions.

Box Regression Output: The ground-truth for the regression output are 4 offset maps agnostic to classes. The instance only affects the $b_{e}^{l}$ region on the offset maps. For each pixel location $(i,j)$ inside $b_{e}^{l}$ , we represent the projected box $b_{p}^{l}$ as a 4-dimensional vector $\mathbf{d}_{i,j}^{l}=[d_{t_{i,j}}^{l},d_{l_{i,j}}^{l},d_{b_{i,j}}^{l},d_{r_{i,j}}^{l}]$ , where $d_{t}^{l}$ , $d_{l}^{l}$ , $d_{b}^{l}$ , $d_{r}^{l}$ are the distances between the current pixel location $(i,j)$ and the top, left, bottom, and right boundaries of $b_{p}^{l}$ , respectively. Then the 4-dimensional vector at $(i,j)$ location across 4 offset maps is set to $\mathbf{d}_{i,j}^{l}/S$ with each map corresponding to one dimension. $S$ is a normalization constant and we choose $S=4.0$ in this work empirically. Locations outside the effective box are the grey area where gradients are ignored. IoU loss [36] is adopted for optimization. The total regression loss of anchor-free branches for an image is the average of the IoU loss over all effective box regions.

During inference, it is straightforward to decode the predicted boxes from the classification and regression outputs. At each pixel location $(i,j)$ , suppose the predicted offsets are $[\hat{o}_{t_{i,j}},\hat{o}_{l_{i,j}},\hat{o}_{b_{i,j}},\hat{o}_{r_{i,j}}]$ . Then the predicted distances are $[S\hat{o}_{t_{i,j}},S\hat{o}_{l_{i,j}},S\hat{o}_{b_{i,j}},S\hat{o}_{r_{i,j}}]$ . And the top-left corner and the bottom-right corner of the predicted projected box are $(i-S\hat{o}_{t_{i,j}},j-S\hat{o}_{l_{i,j}})$ and $(i+S\hat{o}_{b_{i,j}},j+S\hat{o}_{r_{i,j}}])$ respectively. We further scale up the projected box by $2^{l}$ to get the final box in the image plane. The confidence score and class for the box can be decided by the maximum score and the corresponding class of the K-dimensional vector at location $(i,j)$ on the classification output maps.

3.3 Online Feature Selection

The design of the anchor-free branches allows us to learn each instance using the feature of an arbitrary pyramid level $P_{l}$ . To find the optimal feature level, our FSAF module selects the best $P_{l}$ based on the instance content, instead of the size of instance box as in anchor-based methods.

Given an instance $I$ , we define its classification loss and box regression loss on $P_{l}$ as $L_{FL}^{I}(l)$ and $L_{IoU}^{I}(l)$ , respectively. They are computed by averaging the focal loss and the IoU loss over the effective box region $b_{e}^{l}$ , i.e.

[TABLE]

where $N(b_{e}^{l})$ is the number of pixels inside $b_{e}^{l}$ region, and $FL(l,i,j)$ , $IoU(l,i,j)$ are the focal loss [22] and IoU loss [36] at location $(i,j)$ on $P_{l}$ respectively.

Figure 6 shows our online feature selection process. First the instance $I$ is forwarded through all levels of feature pyramid. Then the summation of $L_{FL}^{I}(l)$ and $L_{IoU}^{I}(l)$ is computed in all anchor-free branches using Eqn. (1). Finally, the best pyramid level $P_{l^{*}}$ yielding the minimal summation of losses is selected to learn the instance, i.e.

[TABLE]

For a training batch, features are updated for their correspondingly assigned instances. The intuition is that the selected feature is currently the best to model the instance. Its loss forms a lower bound in the feature space. And by training, we further pull down this lower bound. At the time of inference, we do not need to select the feature because the most suitable level of feature pyramid will naturally output high confidence scores.

In order to verify the importance of our online feature selection, we also conduct a heuristic feature selection process for comparison in the ablation studies (4.1). The heuristic feature selection depends purely on box sizes. We borrow the idea from the FPN detector [21]. An instance $I$ is assigned to the level $P_{l^{\prime}}$ of the feature pyramid by:

[TABLE]

Here 224 is the canonical ImageNet pre-training size, and $l_{0}$ is the target level on which an instance with $w\times h=224^{2}$ should be mapped into. In this work we choose $l_{0}=5$ because ResNet [13] uses the feature map from 5th convolution group to do the final classification.

3.4 Joint Inference and Training

When plugged into RetinaNet [22], our FSAF module works jointly with the anchor-based branches, see Figure 4. We keep the anchor-based branches as original, with all hyperparameters unchanged in both training and inference.

Inference: The FSAF module just adds a few convolution layers to the fully-convolutional RetinaNet, so the inference is still as simple as forwarding an image through the network. For anchor-free branches, we only decode box predictions from at most 1k top-scoring locations in each pyramid level, after thresholding the confidence scores by 0.05. These top predictions from all levels are merged with the box predictions from anchor-based branches, followed by non-maximum suppression with a threshold of 0.5, yielding the final detections.

Initialization: The backbone networks are pre-trained on ImageNet1k [5]. We initialize the layers in RetinaNet as in [22]. For conv layers in our FSAF module, we initialize the classification layers with bias $-\log((1-\pi)/\pi)$ and a Gaussian weight filled with $\sigma=0.01$ , where $\pi$ specifies that at the beginning of training every pixel location outputs objectness scores around $\pi$ . We set $\pi=0.01$ following [22]. All the box regression layers are initialized with bias $b$ , and a Gaussian weight filled with $\sigma=0.01$ . We use $b=0.1$ in all experiments. The initialization helps stabilize the network learning in the early iterations by preventing large losses.

Optimization: The loss for the whole network is combined losses from the anchor-free and anchor-based branches. Let $L^{ab}$ be the total loss of the original anchor-based RetinaNet. And let $L_{cls}^{af}$ and $L_{reg}^{af}$ be the total classification and regression losses of anchor-free branches, respectively. Then total optimization loss is $L=L^{ab}+\lambda(L_{cls}^{af}+L_{reg}^{af})$ , where $\lambda$ controls the weight of the anchor-free branches. We set $\lambda=0.5$ in all experiments, although results are robust to the exact value. The entire network is trained with stochastic gradient descent (SGD) on 8 GPUs with 2 images per GPU. Unless otherwise noted, all models are trained for 90k iterations with an initial learning rate of 0.01, which is divided by 10 at 60k and again at 80k iterations. Horizontal image flipping is the only applied data augmentation unless otherwise specified. Weight decay is 0.0001 and momentum is 0.9.

4 Experiments

We conduct experiments on the detection track of the COCO dataset [23]. The training data is the COCO trainval35k split, including all 80k images from train and a random 35k subset of images from the 40k val split. We analyze our method by ablation studies on the minival split containing the remaining 5k images from val. When comparing to the state-of-the-art methods, we report COCO AP on the test-dev split, which has no public labels and requires the use of the evaluation server.

4.1 Ablation Studies

For all ablation studies, we use an image scale of 800 pixels for both training and testing. We evaluate the contribution of several important elements to our detector, including anchor-free branches, online feature selection, and backbone networks. Results are reported in Table 1 and 2.

Anchor-free branches are necessary. We first train two detectors with only anchor-free branches, using two feature selection methods respectively (Table 1 2nd and 3rd entries). It turns out anchor-free branches only can already achieve decent results. When jointly optimized with anchor-based branches, anchor-free branches help learning instances which are hard to be modeled by anchor-based branches, leading to improved AP scores (Table 1 5th entry). Especially the AP50, APS and APL scores increase by 2.5%, 1.5%, and 2.2% respectively with online feature selection. To find out what kinds of objects the FSAF module can detect, we show some qualitative results of the head-to-head comparison between RetinaNet and ours in Figure 7. Clearly, our FSAF module is better at finding challenging instances, such as tiny and very thin objects which are not well covered by anchor boxes.

Online feature selection is essential. As stated in Section 3.3, we can select features in anchor-free branches either based on heuristics just like the anchor-based branches, or based on instance content. It turns out selecting the right feature to learn plays a fundamental role in detection. Experiments show that anchor-free branches with heuristic feature selection (Eqn. (3)) only are not able to compete with anchor-based counterparts due to less learnable parameters. But with our online feature selection (Eqn. (2)), the AP is improved by 1.2% (Table 1 3rd vs 2nd entries), which overcomes the parameter disadvantage. Additionally, Table 1 4th and 5th entries further confirm that our online feature selection is essential for anchor-free and anchor-based branches to work well together.

How is optimal feature selected? In order to understand the optimal pyramid level selected for instances, we visualize some qualitative detection results from only the anchor-free branches in Figure 8. The number before the class name indicates the feature level that detects the object. It turns out the online feature selection actually follows the rule that upper levels select larger instances, and lower levels are responsible for smaller instances, which is the same principle in anchor-based branches. However, there are quite a few exceptions, i.e. online feature selection chooses pyramid levels different from the choices of anchor-based branches. We label these exceptions as red boxes in Figure 8. Green boxes indicate agreement between the FSAF module and anchor-based branches. By capturing these exceptions, our FSAF module can use better features to detect challenging objects.

FSAF module is robust and efficient. We also evaluate the effect of backbone networks to our FSAF module in terms of accuracy and speed. Three backbone networks include ResNet-50, ResNet-101 [13], and ResNeXt-101 [34]. Detectors run on a single Titan X GPU with CUDA 9 and CUDNN 7 using a batch size of 1. Results are reported in Table 2. We find that our FSAF module is robust to various backbone networks. The FSAF module by itself is already better and faster than anchor-based RetinaNet. On ResNeXt-101, the FSAF module outperforms anchor-based counterparts by 1.2% AP while being 68ms faster. When applied jointly with anchor-based branches, our FSAF module consistently offers considerable improvements. This also suggests that anchor-based branches are not utilizing the full power of backbone networks. Meanwhile, our FSAF module introduces marginal computation cost to the whole network, leading to negligible loss of inference speed. Especially, we improve RetinaNet by 1.8% AP on ResNeXt-101 with only 6ms additional inference latency.

4.2 Comparison to State of the Art

We evaluate our final detector on the COCO test-dev split to compare with recent state-of-the-art methods. Our final model is RetinaNet with the FSAF module, i.e. anchor-based branches plus the FSAF module. The model is trained using scale jitter over scales {640, 672, 704, 736, 768, 800} and for $1.5\times$ longer than the models in Section 4.1. The evaluation includes single-scale and multi-scale versions, where single-scale testing uses an image scale of 800 pixels and multi-scale testing applies test time augmentations. Test time augmentations are testing over scales {400, 500, 600, 700, 900, 1000, 1100, 1200} and horizontal flipping on each scale, following Detectron [10]. All of our results are from single models without ensemble.

Table 3 presents the comparison. With ResNet-101, our detector is able to achieve competitive performance in both single-scale and multi-scale scenarios. Plugging in ResNeXt-101-64x4d further improves AP to 44.6% , which outperforms previous state-of-the-art single-shot detectors by a large margin.

5 Conclusion

This work identifies heuristic feature selection as the primary limitation for anchor-based single-shot detectors with feature pyramids. To address this, we propose FSAF module which applies online feature selection to train anchor-free branches in the feature pyramid. It significantly improves strong baselines with tiny inference overhead and outperforms recent state-of-the-art single-shot detectors.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. Bhagavatula, C. Zhu, K. Luu, and M. Savvides. Faster than real-time facial alignment: A 3d spatial transformer network approach in unconstrained poses. In The IEEE International Conference on Computer Vision (ICCV) , Oct 2017.
2[2] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms—improving object detection with one line of code. In Computer Vision (ICCV), 2017 IEEE International Conference on , pages 5562–5570. IEEE, 2017.
3[3] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality object detection. ar Xiv preprint ar Xiv:1712.00726 , 2017.
4[4] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In Computer Vision (ICCV), 2017 IEEE International Conference on , pages 764–773. IEEE, 2017.
5[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on , pages 248–255. IEEE, 2009.
6[6] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on , pages 304–311. IEEE, 2009.
7[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC 2007) Results. http://www.pascal-network.org/challenges/VOC/voc 2007/workshop/index.html.
8[8] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. ar Xiv preprint ar Xiv:1701.06659 , 2017.