MSFD:Multi-Scale Receptive Field Face Detector

Qiushan Guo; Yuan Dong; Yu Guo; Hongliang Bai

arXiv:1903.04147·cs.CV·March 12, 2019

MSFD:Multi-Scale Receptive Field Face Detector

Qiushan Guo, Yuan Dong, Yu Guo, Hongliang Bai

PDF

Open Access

TL;DR

MSFD is a real-time face detector leveraging multi-scale receptive fields, hierarchical context, and an anchor strategy, achieving high accuracy across various datasets with efficient inference speed.

Contribution

The paper introduces MSFD, a novel face detection method that combines multi-scale receptive fields, a new anchor assignment strategy, and focal loss for improved accuracy and speed.

Findings

01

Achieves superior detection performance on FDDB, Pascal-Faces, and WIDER FACE datasets.

02

Runs at 31 FPS on GPU for VGA images.

03

Effectively detects faces of various scales, including small and rotated faces.

Abstract

We aim to study the multi-scale receptive fields of a single convolutional neural network to detect faces of varied scales. This paper presents our Multi-Scale Receptive Field Face Detector (MSFD), which has superior performance on detecting faces at different scales and enjoys real-time inference speed. MSFD agglomerates context and texture by hierarchical structure. More additional information and rich receptive field bring significant improvement but generate marginal time consumption. We simultaneously propose an anchor assignment strategy which can cover faces with a wide range of scales to improve the recall rate of small faces and rotated faces. To reduce the false positive rate, we train our detector with focal loss which keeps the easy samples from overwhelming. As a result, MSFD reaches superior results on the FDDB, Pascal-Faces and WIDER FACE datasets, and can run at 31 FPS…

Tables1

Table 1. TABLE I : Ablation studies of MSFD. All settings are trained on the training set of WIDER FACE and then tested on the validation set. MSFD(C) adopts the same implementation details with S 3 FD(F) except Context-Texture Module. MSFD(C+F) is trained with focal loss. MSFD is the complete model with anchor assignment

Method	S³FD(F)	MSFD(C)	MSFD(C+F)	S³FD(F+S+M)	MSFD
Scale compensation + Max-out				$\sqrt$
Context-Texture Module		$\sqrt$	$\sqrt$		$\sqrt$
Focal Loss			$\sqrt$		$\sqrt$
Anchor Assignment					$\sqrt$
Easy (mAP %)	92.6	93.8	94.6	93.7	94.7
Medium (mAP %)	91.6	92.7	93.5	92.5	93.5
Hard (mAP %)	82.3	83.8	85.0	85.2	86.5

Equations4

f_{n}^{^{'}} = C (f_{n - 1}, f_{n}, f_{n + 1}),

f_{n}^{^{'}} = C (f_{n - 1}, f_{n}, f_{n + 1}),

L ({p_{i}, t_{i}}) = \frac{λ}{N _{c l s}} i \sum L_{c l s} (p_{i}, p_{i}^{*}) + \frac{1}{N _{r e g}} i \sum I (p_{i}^{*} = 1) L_{r e g} (t_{i}, t_{i}^{*}),

L ({p_{i}, t_{i}}) = \frac{λ}{N _{c l s}} i \sum L_{c l s} (p_{i}, p_{i}^{*}) + \frac{1}{N _{r e g}} i \sum I (p_{i}^{*} = 1) L_{r e g} (t_{i}, t_{i}^{*}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Face and Expression Recognition · Biometric Identification and Security

Full text

MSFD:Multi-Scale Receptive Field Face Detector

Qiushan Guo

Beijing University of

Posts and Telecommunications

Beijing, China

Email: [email protected]

Yuan Dong and Yu Guo

Beijing University of

Posts and Telecommunications

Beijing, China

Email: {yuandong, guoyu24k}@bupt.edu.cn

Hongliang Bai

Beijing Faceall Technology Co.,Ltd

Beijing, China

Email: [email protected]

Abstract

We aim to study the multi-scale receptive fields of a single convolutional neural network to detect faces of varied scales. This paper presents our Multi-Scale Receptive Field Face Detector (MSFD), which has superior performance on detecting faces at different scales and enjoys real-time inference speed. MSFD agglomerates context and texture by hierarchical structure. More additional information and rich receptive field bring significant improvement but generate marginal time consumption. We simultaneously propose an anchor assignment strategy which can cover faces with a wide range of scales to improve the recall rate of small faces and rotated faces. To reduce the false positive rate, we train our detector with focal loss which keeps the easy samples from overwhelming. As a result, MSFD reaches superior results on the FDDB, Pascal-Faces and WIDER FACE datasets, and can run at 31 FPS on GPU for VGA-resolution images.

I Introduction

As one of the fundamental problems in computer vision and pattern recognition, face detection is the key step of various tasks like face alignment [37], face recognition [23] and expression analysis. Face detectors of earlier stage are based on hand-crafted features like Viola-Jones [24]. Those hand-crafted detectors fail to handle complex problems in practical applications such as varied scale of faces, illumination conditions, various poses and facial expressions, etc.

In recent years, CNN-based methods have made great progress in classification and detection tasks. Considering the excellent representation power of CNNs, many elegant object detectors have appeared. In general, face detection can be viewed as a special case for generic object detection. Many of the face detectors with excellent performance trail the ideas of anchor-based detection methods like RCNN[2] and SSDs [13]. These methods regress a series of anchors with pre-set shape towards objects and classify them. However, subtle changes are needed to handle face detection. We take three typical aspects to illustrate.

I-A Structure

RCNN-based methods put anchor on the last layer of the CNN. But when the object becomes small, the recall rate will drop dramatically, so with the small faces. Higher-level features with larger receptive fields have an abstract semantic information, tending to ignore small ones. To get better result, low-level features containing rich feature together with suitable receptive field are needed. Therefore, we utilize feature pyramid for our face detection framework.

I-B Feature fusion strategy

The Single Shot Detector (SSD) [13] is one of the first attempts at using a ConvNet’s pyramidal feature hierarchy. Lately, FPN network [11] achieves a top-down information transmission making it possible for abstract semantic information to be transmitted to low-level features, thus providing context information for the detection of small objects. Our fusion of features exploits the context relationship between features on layers of different levels. It is also worth noticing that the feature fusion also brings changes in the receptive field, and we only fuse adjacent features to avoid too large receptive fields, which barely make contribution to the detection of small faces.

I-C Class imbalance

In order to improve the recall rate of faces, anchors should be dense or matching threshold should be loose. However, densely assigning anchors like S3FD [34] produces a pretty large number of negative examples, leading to extreme unbalance sample distribution. A common solution is to perform some form of hard negative mining by selecting hard negative ones to feed into training or carrying out more complex sampling/reweighting schemes. In this work, we refer to focal loss [12] to handle the class imbalance and train efficiently on examples without easy negatives overwhelming the loss.

The main contributions of this paper can be summarized as:

•

We propose a novel feature agglomeration framework for face detection, which gets more context and texture from the adjacent feature maps so as to increase recall rate.

•

We introduce an anchor assignment strategy to improve the recall rate of rotated faces and outer faces.

•

We adopt focal loss to deal with the imbalance over face and background and reduce the high false positive rate of small faces. And we prove that focal loss can improve recall rate of face.

•

We achieve superior results on PASCAL-Faces, FDDB and WIDER FACE with less cost of computation compared with the state-of-the-art performances.

II Related Work

As a fundamental literature in the computer vision, face detection has been extensively studied in recent years. Viola-Jones detection framework [24] is a groundbreaking work using Harr feature and Adaboost to train a cascade classifier, achieving a fairly good result. Since then, researchers have focused on designing more powerful hand-crafted features [8, 10, 28, 1, 36, 15] to improve the performance. However, those traditional approaches rely heavily on the effectiveness of hand-crafted feature and optimize solely on each component, causing a sub-optimal problem to the whole pipeline.

In recent years, as the deep learning techniques, especially the convolutional neural networks(CNNs), gradually gain popularity and produce remarkable results on numerous computer vision tasks, CNN-based face detectors has become the mainstream. Among these, CascadeCNN [9] and MTCNN [31] both train a cascade structure for detection, while the latter uses multi-task CNNs to solve detection and alignment jointly. Yang et al. [29] trains multiple CNNs for facial attributes to enhance the detection of occluded faces.

Naturally, face detection can be regarded as a special case for generic object detection, the framework of which may also be transfered to fit the face detection task. Faster R-CNN [20] is one of the state-of-the-art detection pipelines composing of two stages. Based on that, Jiang et al. [7] build a face detector and the performance is fairly good. Wan et al. [25] and Sun et al. [22] both add some effective stategies including hard example mining and feature fusion on Faster R-CNN in order to achieve better results, while CMS-RCNN [35] attaches body contextual information as well. What’s more, Wang et al. [26] adopt another two-stage framework, R-FCN, to build their detector and state-of-the-art on FDDB dataset [5].

The single-stage detector including SSD [13] and YOLO [19] is another popular form of detection pipeline which simultaneously performs classification and regression. SSH [16] is the typical single stage detector with context modules. Inspired by SSD and RPN [20], Zhang et al. propose S3FD [34] with anchor matching strategy and max-out background label to ensure state-of-the-art performance on WIDER FACE [30] with real-time speed. In this work, we develop a superior face detector with real-time speed which adopts pyramidal feature hierarchy and aggregates multi-scale features with Context-Texture module.

III method

III-A General Architecture

Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has more semantics at high levels and more texture at low levels. To this end, we propose a novel hierarchical feature agglomeration structure which aggregates adjacent features to increase recall rate.

III-A1 Constructing architecture

The general architecture is shown in Fig.1. We inherit the backbone from [34] (based on the VGG16 network) and extract feature from conv3_3, conv4_3, conv5_3, conv_fc7, conv6_2 and conv7_2. They have the stride of {4,8,16,32,64,128} pixels with respect to the input image. Deeper layers have larger receptive field which can help to detect faces with different sizes. Most faces in the pictures from the Internet can be detected with the help of flexible receptive fields. As features from conv3_3, conv4_3 and conv5_3 have different scales compared with the other layers’, we apply L2 normalization [14] to rescale them. And then we concatenate them with other features after transformation of the feature maps with different shapes.

III-A2 Context-Texture module

To get more context and texture, we utilize Context-Texture block for adjacent selected feature maps with different shapes. The n-th selected layer feature is denoted as $f_{n}$ and the merged feature is denoted as $f^{{}^{\prime}}_{n}$ . The process of merging can be expressed as follows:

[TABLE]

where $\mathcal{C}$ is the Context-Texture module in Fig. 2. The function’s input doesn’t contain $f_{n-1}$ when $n=1$ . And the same goes for $f_{n+1}$ when $n+1$ is greater than the sum of the number of feature maps. Each Context-Texture block agglomerates the shallower feature $f_{n-1}$ and the deeper feature $f_{n+1}$ to enrich receptive field.

However, additional information is helpful but should not be overwhelming. Supposing that the channel of $f_{n}$ is reduced to $N$ (e.g. 256) by using $1\times 1$ convolutional filters, we reduce the channels of $f_{n-1}$ and $f_{n+1}$ to $\frac{N}{8}$ in the same way. After that, we reshape feature maps from different levels to the same 2-D shape by adding bi-linear interpolation to deeper ones and max-pooling with stride = 2 to shallower ones. The final agglomerative feature $f^{{}^{\prime}}_{n}$ is obtained by concatenating these three features. Considering that too large or too small receptive fields can degrade the performance of detection [4] as well as time and memory consumption, we only take adjacent features for agglomeration.

III-A3 Classification Subnet and Box Regression Subnet

To the agglomerated feature map attaches two subnetwork, one for classifying the anchor boxes and the other one for regressing from anchor boxes to ground-truth boxes. The classification subnet applies two $3\times 3$ conv layers, each with 128 filters and followed by ReLU activations. The conv layers are then fed forward to a $3\times 3$ conv layer with $K\times A$ filters where $K$ is the number of classes and $A$ is the number of anchors per location. These two $3\times 3$ conv layers can extract semantic information from the agglomerative feature map to classify accurately. In parallel with the classification subnet, the box regression subnet is the same with the classification subnet except that the last conv layer outputs $4\times{A}$ relative offsets between the anchor and the ground-truth box.

III-B Anchor Assignment Strategy

During training, we need to determine which anchor corresponds to a face bounding box. Anchor box can match faces whose size is similar. However, the size of the face ranges widely. As shown in Fig. 3, most faces have an object size from 4 to 362 pixels. So, we define the anchors to have areas of { $16^{2}$ , $32^{2}$ , $64^{2}$ , $128^{2}$ , $256^{2}$ , $512^{2}$ } on these agglomerative feature maps. Each scale corresponds to a feature layer. Faces whose scale distribution lies away from anchors’ scale can not match enough anchors, such as tiny and outer face in Fig. 4, leading to their low recall rate. Increasing the density of anchors and reducing the difficulty of matching can both contribute to a higher recall rate. To increase the density of anchors, we set the anchors’ aspect ratios to 1 and 1.5 depending on the mutable aspect ratios of faces. Anchors are assigned to a ground-truth box with the highest IoU larger than 0.5, and to background if the highest IoU is less than 0.4. Unassigned anchors are ignored during training.

III-C Training

III-C1 Loss function

Face detection methods encounter a salient class imbalance during training. We apply this model to a large number of random images and find that about 99.9% of the anchors belong to negative samples and only a few of them are positive ones. In order to solve the extreme imbalance of positive and negative samples, most face detection frameworks adopt online hard negative mining strategy, which is helpful. However, the recently proposed focal loss [12] is more powerful. We employ a multi-task loss function to jointly optimize model parameters:

[TABLE]

where i is the index of an anchor and $p_{i}$ is the predicted probability of whether anchor $i$ is a face. The ground-truth label $p^{*}_{i}$ is 1 if the anchor is positive, 0 for negative. As defined in [20], $t_{i}$ is a vector representing the 4 parameterized coordinates of the predicted boundingbox and $t^{*}_{i}$ is that of the ground-truth box associated with a positive anchor. The classification loss $L_{cls}(p_{i},p^{*}_{i})$ is focal loss over two classes (face and background) parameterized with $\alpha=0.25$ and $\gamma=2$ . The regression loss $L_{reg}(t_{i},t^{*}_{i})$ is smooth L1 loss defined in [2]. $I(p^{*}_{i}=1)$ is the indicator function that limits the regression loss only focusing on the positively assigned anchors. The two losses are balanced by $\lambda$ . $N_{cls}$ and $N_{reg}$ are the number of the positive anchors.

III-C2 Training dataset and data augmentation

Our model is trained on $12,880$ images of the WIDER FACE training set. The distribution of face scale for this set is shown in Fig. 3. The faces with size below 20 pixels affect the average precision of hard detection tasks in WIDER FACE. So, we randomly crop square patches with scale ranging from 0.3 to 1 of the shorter side from original image for training. In addition, the overlapped part of the face box is discarded if its center is out of the sampled patch. After randomly cropping, we employ color distortion strategy to preprocess training images, e.g. the adjustment of brightness, contrast, and saturation. Finally, the square patch is resized to $640\times 640$ and horizontally flipped with probability of 0.5.

III-C3 Other implementation details

We use pre-trained [21] VGG16 as backbone. The parameters of conv_fc6 and conv_fc7 are initialized by subsampling parameters from fc6 and fc7 of VGG16 and the other additional layers (conv6_1, conv6_2, conv7_1, conv7_2) are randomly initialized with the “Xavier” method [3]. We initiate all convolutional layers except the final one for the classification subnet with bias $b=0$ and a Gaussian weight filler with $\sigma=0.01$ . The final convolutional layer is initiated with bias $b=-log((1-\pi)/\pi)$ and $\pi=0.01$ here. And the $\lambda$ in Eq.(2) is set to 3 to balance the loss of classification and regression. We use SGD optimizer with momentum of 0.9, weight decay of 0.0005, and a total batch size of 16 on 4 GPUs. The maximum number of iterations is $120k$ and the learning rate starts at $10^{-3}$ and becomes 10 times smaller at $80k$ and $100k$ iterations. Our implementation is based on Caffe [6].

IV Experiments

IV-A Model analysis

We analyze our model on the WIDER FACE validation set by extensive experiments and ablation studies. According to the difficulty of detection tasks, the validation set is set split to easy, medium and hard subsets. The evaluation metric is mean average precision (mAP) with Interception-of-Union (IoU) threshold of 0.5.

IV-A1 Baseline

To evaluate our contributions, we adopt the closely related detector S3FD as the baseline. This is because both our method and S3FD base on the same backbone. We compare our method with the performance of the S3FD under two different settings: (i) $S^{3}FD(F)$ : it only uses the scale-equitable framework. (ii) $S^{3}FD$ ( $F$ + $S$ + $M$ ): it is the complete model. Here, $S$ is the scale compensation anchor matching strategy which makes faces match enough anchors and $M$ is the the max-out background label to address the unbalanced binary classification problem. For optimization and training, S3FD adopts online hard negative mining with softmax loss.

From the results listed in TABLE I, some conclusions can be summed up in the next subsections.

IV-A2 Context-Texture module can help us judge more accurately

Context-Texture module agglomerates adjacent features in order to enrich the receptive fields. To better understand the impact of feature agglomerating, we adopt the same anchor assignment strategy, training parameters and implementation details with $S^{3}FD(F)$ . TABLE I indicates that the performance gains significant improvement.

IV-A3 Focal loss is elegant for handling the class imbalance

The second and third column in TABLE I show that focal loss can effectively solve the problem of sample imbalance over face and background. The mAP of easy, medium and hard subset is increased by 0.8%, 0.8%, 1.2%. The increases mainly come from preventing the vast number of easy negatives from overwhelming the detector during training.

IV-A4 Anchor assignment strategy can match more

The result shows that our anchor assignment strategy can make faces in hard subset match enough anchors so that the mAP of hard subset improve 1.5%. This strategy makes the rotated and tiny faces, which is the main component of hard subset, match more anchors. But increasing the density of large anchors seems less helpful to the results of easy and medium subset.

IV-B Evaluation on benchmark

We evaluate our MSFD method on the common face detection benchmarks, including PASCAL-Face [27], FDDB [5] and WIDER FACE [30].

IV-B1 FDDB dataset

The dataset is a well-known benchmark and it contains 5,171 faces in 2,845 images. We transform the predicted bounding boxes to ellipses to get a precise result. We adopt annotations released from [34] to avoid false positive faces with high scores caused by unlabelled faces. Fig. 5(a) and Fig. 5(b) show the evalution results, our method achieves promising results on both discontinuous and continuous ROC curves compared with the previous state-of-the-art methods [34, 31, 4, 32, 22, 33, 17, 18]. This indicates that our MSFD can detect unconstrained faces robustly.

IV-B2 PASCAL-Face dataset

This dataset was collected from PASCAL person layout test subset. It has 1,335 labeled faces in 851 images. Fig. 5(c) shows the precision-recall curves. Our method gets 98.71% mAP which outperforms the previous state-of-the-art detector S3FD (98.49%) and SSH (98.27%)[16], and beats the other methods [18, 15, 27, 24].

IV-B3 WIDER FACE dataset

It contains 32,203 images with 393,703 annotated faces with a high degree of variability in scale, pose and occlusion. The training set has 158,989 faces and these are 40% of the total set. The validation set accounts for 10% and the test set accounts for 50%. According to the difficulty of detection tasks, the validation and test set are is devided into “easy”, “medium” and “hard” subsets. The difficulty is determined by the size of the face in the picture. Most faces in “hard” subset have a small shape. We train our detector on the training set and test on both validation and test set. The precision-recall curves and mAP are shown in Fig. 6. It achieves 94.2%(Easy), 93.1%(Medium) and 86.3%(Hard) for test set. And 94.7%(Easy), 93.5%(Medium) and 86.5%(Hard) for validation set. This outperforms overwhelming majority of the submitted results.

IV-C Inference time

Our method outputs lots of boxes, so we should better filter out the boxes with low confidence by a threshold of 0.05 and keep the top 300 boxes before NMS. Then we apply NMS with jaccard overlap of 0.3 and keep the top 200 boxes. We test our computation cost on NIVIDIA 1080Ti with Intel Xeon [email protected]. For the VGA-resolution image, our detector can run at 31 FPS and achieve the real-time speed. The majority of the time consumption is spent on the VGG16 backbone network, so a lightweight network could be more efficient.

V Conclusion

This paper proposes a novel face detector by enriching the receptive field with Context-Texture module. The proposed method has superior performance on various common face detection benchmarks and enjoys real-time inference speed on GPU. We analyze the relationship between receptive field and the accuracy of face detection module, then propose a method to agglomerate contextual information and textural information by hierarchical structure. Moreover, we propose the dense anchor assignment strategy to improve the recall rate of small faces and outer faces. And we train the model robustly in an end-to-end manner with focal loss to deal with the large class imbalance over small faces and background. The experiments demonstrate that our method results in the superior performance. In future work, we intend to further improve the anchor assignment strategy. It’s crucial to generate more accurate anchors to reduce the cost of computation and false positives.

VI Acknowledgment

This work is supported by Chinese National Natural Science Foundation under Grants 61532018.

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Dong Chen, Shaoqing Ren, Yichen Wei, Xudong Cao, and Jian Sun. Joint cascade face detection and alignment. In European Conference on Computer Vision , pages 109–122. Springer, 2014.
2[2] Ross Girshick. Fast r-cnn. In The IEEE International Conference on Computer Vision (ICCV) , December 2015.
3[3] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , pages 249–256, 2010.
4[4] Peiyun Hu and Deva Ramanan. Finding tiny faces. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1522–1530. IEEE, 2017.
5[5] Vidit Jain and Erik Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. Technical report, Technical Report UM-CS-2010-009, University of Massachusetts, Amherst, 2010.
6[6] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia , pages 675–678. ACM, 2014.
7[7] Huaizu Jiang and Erik Learned-Miller. Face detection with the faster r-cnn. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on , pages 650–657. IEEE, 2017.
8[8] Haoxiang Li, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Gang Hua. Efficient boosted exemplar-based face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1843–1850, 2014.