Leveraging Orientation for Weakly Supervised Object Detection with   Application to Firearm Localization

Javed Iqbal; Muhammad Akhtar Munir; Arif Mahmood; Afsheen Rafaqat Ali,; Mohsen Ali

arXiv:1904.10032·cs.CV·February 2, 2021

Leveraging Orientation for Weakly Supervised Object Detection with Application to Firearm Localization

Javed Iqbal, Muhammad Akhtar Munir, Arif Mahmood, Afsheen Rafaqat Ali,, Mohsen Ali

PDF

Open Access 1 Repo

TL;DR

This paper introduces a weakly supervised orientation-aware object detection algorithm that effectively detects firearms with oriented bounding boxes, using only axis-aligned annotations during training, and demonstrates superior performance on a new challenging dataset.

Contribution

The paper proposes a novel multistage OAOD algorithm that learns to detect oriented bounding boxes from AABB annotations, reducing the need for OBB during training.

Findings

01

OAOD outperforms state-of-the-art detectors with 88.3 mAP on AABB

02

Achieves 77.5 mAP on OBB detection

03

Introduces the ITU Firearm dataset with 11,000 annotated images

Abstract

Automatic detection of firearms is important for enhancing the security and safety of people, however, it is a challenging task owing to the wide variations in shape, size, and appearance of firearms. Also, most of the generic object detectors process axis-aligned rectangular areas though, a thin and long rifle may actually cover only a small percentage of that area and the rest may contain irrelevant details suppressing the required object signatures. To handle these challenges, we propose a weakly supervised Orientation Aware Object Detection (OAOD) algorithm which learns to detect oriented object bounding boxes (OBB) while using AxisAligned Bounding Boxes (AABB) for training. The proposed OAOD is different from the existing oriented object detectors which strictly require OBB during training which may not always be present. The goal of training on AABB and detection of OBB is…

Tables7

Table 1. Table 1: An overview of different oriented object detection methods with respect to application domain, ground truth used during training and output bounding boxes (AABB or OBB).

Methods

Domain

Ground Truth for Training

Output

OBB

AABB

RRPN [15]

Text Images

Oriented Boxes

✓

R2CNN++ [20]

Aerial Images

Oriented Boxes

✓

✓

DMP-Net [33]

Text Images

Oriented Boxes

✓

FOTS [34]

Text Images

Oriented Boxes

✓

ICN+FPN [16]

Aerial Images

Oriented Boxes

✓

✓

RBox-CNN [35]

Aerial Images

Oriented Boxes

✓

RoI-Trans [5]

Aerial Images

Oriented Boxes

✓

OAOD (Ours)

RGB Images

Axis-Aligned

+Angle

✓

✓

Table 2. Table 2: OAOD-AA (no orientation offsets) & OAOD-AA+ (with orientation offsets) vs state-of-the-art AABB object detectors at multiple IoU levels. A P g 𝐴 subscript 𝑃 𝑔 AP_{g} & A P r 𝐴 subscript 𝑃 𝑟 AP_{r} : Average Precision of gun & rifle respectively. Highest values shown in Red , 2 n d superscript 2 𝑛 𝑑 2^{nd} highest shown in Blue .

Methods

A ​ P_{40}

A ​ P_{50}

A ​ P_{60}

A ​ P_{g}

A ​ P_{r}

m ​ A ​ P

A ​ P_{g}

A ​ P_{r}

m ​ A ​ P

A ​ P_{g}

A ​ P_{r}

m ​ A ​ P

YOLOv2

70.7

83.3

77.0

62.3

77.0

69.6

41.9

62.9

52.4

YOLOv3

80.8

78.6

79.8

76.0

70.7

73.4

64.3

59.0

61.7

SSD

70.6

79.0

74.8

65.6

73.0

69.3

55.2

58.2

56.7

DSSD

77.4

78.9

78.1

73.0

72.3

72.7

63.2

58.9

61.1

FRCNN

88.7

89.0

88.9

80.2

79.4

79.8

67.8

68.3

68.1

OAOD-AA

88.8

89.6

89.2

84.4

86.4

85.4

67.0

74.0

70.3

OAOD-AA+

89.6

90.2

89.9

87.6

88.9

88.3

73.2

78.1

75.7

Table 3. Table 3: Comparison of the proposed OAOD-OB (no orientation offsets) & OAOD-OB+ (with orientation offsets) with state-of-the-art OBB detectors at different IoU levels. O B B r o t 𝑂 𝐵 subscript 𝐵 𝑟 𝑜 𝑡 OBB_{rot} are rotated version of AABB whereas O B B a n n 𝑂 𝐵 subscript 𝐵 𝑎 𝑛 𝑛 OBB_{ann} are manually annotated oriented boxes. Highest values shown in Red , 2 n d superscript 2 𝑛 𝑑 2^{nd} highest shown in Blue

Methods	Baseline	$O B B_{r o t}$		$O B B_{a n n}$
Methods	Baseline	$A P_{50}$	$A P_{60}$	$A P_{50}$	$A P_{60}$
R2CNN $+, +$	ResNet-50	54.5	44.8	43.0	27.9
DOTA-FRCNN	ResNet-101	58.7	46.9	57.1	46.3
RoI-Trans	ResNet-101	77.5	45.9	68.5	48.5
OAOD-OB	VGG16	77.9	62.2	69.7	50.2
OAOD-OB+	VGG16	82.3	63.8	77.5	49.6

Table 4. Table 4: Orientation accuracy and mean average precision (stage 1) with varying values of β 𝛽 \beta in ( 7 ) over the validation dataset. The Red represent high values and used in our experiments.

$β$	1	0.5	0.325	0.25	0.125	0.1	0.0625
${mAP}_{v a l i d a t i o n}$	51.5	62.9	66.6	72.5	71.9	74.8	72.5
Accuracy	84.4	83.5	84.3	84.2	83.9	84.7	82.9

Table 5. Table 5: Orientation absolute error and mean average precision (stage 1) with varying values of η 𝜂 \eta in ( 7 ) over the validation dataset. The Red represent high mAP while the Blue shows the values used in our experiments as absolute error is less in this case with comparable mAP.

$η$	1.5	1.25	1.0	0.75	0.50	0.25
${mAP}_{v a l i d a t i o n}$	77.9	71.1	78.2	77.7	78.3	77.6
Absolute Error	4.9	4.8	4.4	4.8	4.6	4.9

Table 6. Table 6: Orientation accuracy, orientation absolute error and mean average precision (stage 1) with varying number of orientation classes in ( 7 ) over the validation set. The Red represent high mAP while the Blue shows the values used in our experiments with smaller absolute error and comparable mAP.

$θ_{n}$	4	8	12
${mAP}_{v a l i d a t i o n}$	67.5	74.8	76.6
Accuracy	91.1	84.7	69.7
Absolute Error	10.6	3.8	4.6

Table 7. Table 7: Comparison of OAOD with baseline (FRCNN), Stage-1 and 2-Loss Net for AABB task. Highest values shown in Red , 2 n d superscript 2 𝑛 𝑑 2^{nd} highest shown in Blue

I ​ o ​ U

FRCNN

Stage-1 Net

2-Loss Net

OAOD-AA

OAOD-AA+

0.4

88.9

88.8

89.2

89.9

0.5

79.8

82.3

82.9

85.4

88.3

0.6

68.1

66.1

65.9

70.3

75.7

Equations20

L_{1}^{f} (p_{1}^{f}, u_{1}^{f}, n_{b}) = i = 1 \sum n_{b} j = 1 \sum n_{f} u_{1}^{f} (i, j) log (p_{1}^{f} (i, j)),

L_{1}^{f} (p_{1}^{f}, u_{1}^{f}, n_{b}) = i = 1 \sum n_{b} j = 1 \sum n_{f} u_{1}^{f} (i, j) log (p_{1}^{f} (i, j)),

L_{1}^{o} (p_{1}^{o}, u_{1}^{o}, n_{b}) = i = 1 \sum n_{b} j = 0 \sum n_{o} δ_{i} u_{1}^{o} (i, j) l o g (p_{1}^{o} (i, j)),

L_{1}^{o} (p_{1}^{o}, u_{1}^{o}, n_{b}) = i = 1 \sum n_{b} j = 0 \sum n_{o} δ_{i} u_{1}^{o} (i, j) l o g (p_{1}^{o} (i, j)),

L_{1}^{b} (p_{1}^{b}, u_{1}^{b}, n_{b}) = i = 1 \sum n_{b} j = 1 \sum 4 δ_{i} S_{ℓ_{1}} (p_{1}^{b} (i, j) - u_{1}^{b} (i, j))

L_{1}^{b} (p_{1}^{b}, u_{1}^{b}, n_{b}) = i = 1 \sum n_{b} j = 1 \sum 4 δ_{i} S_{ℓ_{1}} (p_{1}^{b} (i, j) - u_{1}^{b} (i, j))

S_{ℓ_{1}} (x) = {0.5 x^{2} ∣ x ∣ - 0.5 if \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak ∣ x ∣ < 1 Otherwise

S_{ℓ_{1}} (x) = {0.5 x^{2} ∣ x ∣ - 0.5 if \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak ∣ x ∣ < 1 Otherwise

missing u_{1}^{r} = \frac{u _{1}^{o} - r _{g t}}{r _{m}},

missing u_{1}^{r} = \frac{u _{1}^{o} - r _{g t}}{r _{m}},

L_{1}^{r} (p_{1}^{r}, u_{1}^{r}, n_{b}) = i = 1 \sum n_{b} j = 1 \sum n_{o} δ_{i} S_{ℓ_{1}} (p_{1}^{r} (i, j) - u_{1}^{r} (i, j)),

L_{1}^{r} (p_{1}^{r}, u_{1}^{r}, n_{b}) = i = 1 \sum n_{b} j = 1 \sum n_{o} δ_{i} S_{ℓ_{1}} (p_{1}^{r} (i, j) - u_{1}^{r} (i, j)),

L_{1} = α L_{1}^{f} (p_{1}^{f}, u_{1}^{f}, n_{b}) + β L_{1}^{o} (p_{1}^{o}, u_{1}^{o}, n_{b}) + γ L_{1}^{b} (p_{1}^{b}, u_{1}^{b}, n_{b}) + η L_{1}^{r} (p_{1}^{r}, u_{1}^{r}, n_{b})

L_{1} = α L_{1}^{f} (p_{1}^{f}, u_{1}^{f}, n_{b}) + β L_{1}^{o} (p_{1}^{o}, u_{1}^{o}, n_{b}) + γ L_{1}^{b} (p_{1}^{b}, u_{1}^{b}, n_{b}) + η L_{1}^{r} (p_{1}^{r}, u_{1}^{r}, n_{b})

L_{2}^{f} (p_{2}^{f}, u_{2}^{b}, n_{b}) = i = 1 \sum n_{b} j = 1 \sum n_{f} u_{2}^{f} (i, j) log (p_{2}^{f} (i, j))

L_{2}^{f} (p_{2}^{f}, u_{2}^{b}, n_{b}) = i = 1 \sum n_{b} j = 1 \sum n_{f} u_{2}^{f} (i, j) log (p_{2}^{f} (i, j))

L_{2}^{b} (p_{2}^{b}, u_{2}^{b}, n_{b}) = i = 1 \sum n_{b} j = 1 \sum 4 Θ_{i} S_{ℓ_{1}} (p_{2}^{b} (i, j) - u_{2}^{b} (i, j))

L_{2}^{b} (p_{2}^{b}, u_{2}^{b}, n_{b}) = i = 1 \sum n_{b} j = 1 \sum 4 Θ_{i} S_{ℓ_{1}} (p_{2}^{b} (i, j) - u_{2}^{b} (i, j))

L_{2} = L_{2}^{f} (p_{2}^{f}, u_{2}^{f}, n_{b}) + L_{2}^{b} (p_{2}^{b}, u_{2}^{b}, n_{b})

L_{2} = L_{2}^{f} (p_{2}^{f}, u_{2}^{f}, n_{b}) + L_{2}^{b} (p_{2}^{b}, u_{2}^{b}, n_{b})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

makhtar17004/orientation-aware-firearm-detection
caffe2Official

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications

Full text

Leveraging Orientation for Weakly Supervised Object Detection with Application to Firearm Localization

Javed Iqbal

[email protected]

Muhammad Akhtar Munir

Arif Mahmood

Afsheen Rafaqat Ali

Mohsen Ali

Information Technology University, Lahore, 54000, Pakistan

Abstract

Automatic detection of firearms is important for enhancing the security and safety of people, however, it is a challenging task owing to the wide variations in shape, size and appearance of firearms. Also, most of the generic object detectors process axis-aligned rectangular areas though, a thin and long rifle may actually cover only a small percentage of that area and the rest may contain irrelevant details suppressing the required object signatures. To handle these challenges, we propose a weakly supervised Orientation Aware Object Detection (OAOD) algorithm which learns to detect oriented object bounding boxes (OBB) while using Axis-Aligned Bounding Boxes (AABB) for training. The proposed OAOD is different from the existing oriented object detectors which strictly require OBB during training which may not always be present. The goal of training on AABB and detection of OBB is achieved by employing a multistage scheme, with Stage-1 predicting the AABB and Stage-2 predicting OBB. In-between the two stages, the oriented proposal generation module along with the object aligned RoI pooling is designed to extract features based on the predicted orientation and to make these features orientation invariant. A diverse and challenging dataset consisting of eleven thousand images is also proposed for firearm detection which is manually annotated for firearm classification and localization. The proposed ITU Firearm dataset (ITUF) contains a wide range of guns and rifles. The OAOD algorithm is evaluated on the ITUF dataset and compared with current state-of-the-art object detectors, including fully supervised oriented object detectors. OAOD has outperformed both types of object detectors with a significant margin. The experimental results (mAP: 88.3 on AABB & mAP: 77.5 on OBB) demonstrate effectiveness of the proposed algorithm for firearm detection.

keywords:

Oriented Object Detection, Firearms Detection, Gun Violence, Surveillance and Security, Weakly-Supervised Object Detection, Deep Convolutional Neural Networks

††journal: Neurocomputing

1 Introduction

In recent years, the world has witnessed an exponential increase in gun violence, morphing from isolated street crimes to incidences of mass shootings [1, 2, 3]. Governments and private security agencies have been expanding the use of surveillance systems to monitor and secure public and private spaces. Mostly these surveillance systems are based on massive installations of camera-based surveillance systems which are mostly passive and where monitoring is delegated to the human operators. Usually, video from multiple CCTV cameras is streamed into a central station, where trained operators monitor these live footage, proactively watching for unusual activities and prohibited objects such as weapons. The operator’s ability to detect abnormality while monitoring a video feed is influenced by many variables including both technical (quality of images) and human factors such as age, experience, training and shift duration [4].

Studies have shown that the human ability to detect abnormalities from live feeds reduces as the number of simultaneous feeds increases [7, 8]. The firearm based incidents are more difficult to detect since mostly they don’t involve physical altercation but just the presence of the firearm changes the dynamics of the situation. A visual firearm detection system would not only be helpful in active security monitoring but also it would be vital in monitoring harmful content on social media. Such a novel scientific solution can be embedded in surveillance systems for significant improvement in identifying potential gun violence incidence.

Despite an immense need to develop a firearms detection system, due to a number of challenges, no significant research work has yet been done in this direction. Visual firearm detection is inherently challenging due to intentional or unintentional occlusions, the close proximity of the object to the human body and design inspired for the camouflage. Existing visual object detectors ([6, 9, 10]) despite being successful in detecting in a wide variety of common objects, do not perform well when dealing with firearms (Fig. 1). One of the reasons being that most of the existing object detectors predict object locations by looking at features in axis-aligned bounding boxes [11, 12, 13, 14, 6, 9, 10].

A physically thin and elongated structure of the rifles and small size of most guns, make these axis-aligned detectors inefficient due to low signal to noise ratio where the signal is the firearm signature and noise is everything else in the bounding box. This problem is evident in the case of firearms being carried by a person, the axis-aligned bounding box will tend to contain substantial information belonging to the background or non-firearm objects, like the person himself (Fig. 2). The inherent size & shape variations of long and thin firearms, unfavorable viewing angles, and clutter make the detection more challenging than other objects such as human faces and vehicles.

Recently some oriented object detection methods have also been proposed that try to detect Oriented Bounding Boxes (OBB) aligned with the objects. These algorithms target applications such as ships and aeroplanes in satellite images, and text in documents, by predicting oriented region proposals [15, 16]. However, this requires to have oriented boxes as a part of anchors at each location of the feature map, resulting in computational inefficiency due to a significant increase in the number of anchors. Also, for the training of region proposal network (RPN) and the detector itself, oriented boxes are needed as ground-truth. Annotating such boxes is quite a time consuming and erroneous, which is the reason that most datasets provide only axis-aligned boxes.

To address these challenges, we propose an Orientation Aware multi-stage object detection system (OAOD), which is trained in a weakly supervised fashion, only on the axis-aligned bounding boxes (AABB) and orientation of the firearms. In our proposed system, RPN and orientation prediction are kept separate, allowing us to use a smaller number of anchors than oriented object detectors. In the first stage of our proposed system AABB and object orientation are jointly estimated, and in the second stage, a novel Oriented Proposal Generation (OPG) module is introduced to generate Oriented Region Proposal ( $ORP$ ) by incorporating the predicted orientation information. The OPG is followed by Object Aligned Region of Interest pooling (OARoI-Pooling) to pool the features without background noise. Our proposed system predicts both axis-aligned as well as object aligned bounding boxes, while only being trained on the axis-aligned bounding boxes and orientation information in a weakly-supervised fashion. Main contributions of the current work include:

•

We propose a weakly supervised deep learning architecture to predict Oriented Bounding Boxes (OBB) without using OBB annotations while training.

•

Orientation classification and regression module are proposed to predict orientation from the axis-aligned region proposals.

•

An Oriented Proposal Generation (OPG) module is proposed to generate Oriented Region Proposals ( $ORP$ ) followed by Object Aligned RoI-pooling (OARoI-Pooling) to pool target object features while discarding the background noise. Such a setup results in the features that are independent of the object’s orientation simplifying the task of classification and bounding box regression. Thus improving the accuracy of classifier and bounding box regressor in the last stage.

•

An extensive firearm dataset, ITU-Firearm (ITUF), is also proposed consisting of around 13647 annotated firearm instances in 10973 images.

•

Our method achieves state-of-the-art performance compared to existing methods on the proposed ITUF dataset.

For a comprehensive analysis, the proposed OAOD algorithm is compared with five existing state-of-art axis-aligned object detection methods [6, 17, 18, 10, 19] and three oriented object detection methods [5, 20, 21]. The proposed OAOD has produced an excellent performance in terms of accuracy and stability compared to these existing methods.

2 Related Work

Generic Object Detectors: Significant progress has been made in developing deep-CNN based axis-aligned generic object detectors, which could be divided into two categories, including, multi-stage and single-stage detectors. Multi-stage detectors, generally contain the first stage of RPN that selects one of the predefined anchor boxes as the proposal at each location [6, 22, 23, 24, 25]. The next stage is to regress the final bounding boxes and classify them to the object classes. The single-stage object detectors, like YOLO family [9, 18, 17] and SSD [10], are well known for high detection speed but have been found to have lower performance [26]. Attributing to the class imbalance, Lin et al. [27] proposed focal loss, however, they still suffer from performance degradation while detecting small objects. The research community is putting efforts to make object detection proposal or anchor free [28, 29, 30] though two-stage detectors [25, 31, 32] still have better accuracy due to better region sampling. Axis-aligned object detectors do not handle thin and elongated objects detection challenge, where orientation makes object-size vs the AABB size disproportionate.

**Weakly Supervised Object Detectors: Weakly supervised object detection has been broadly studied in last few years. Early approaches [36] exploited representation learned by deep convolutional neural network pre-trained on image classification task. In general features were pooled from the regions, indicated by region proposal generation process, image classification and object detection were jointly trained on these by back-propagating the image classification loss. [37], on the other hand proposed a Multiple Instance Learning to iteratively refine the object detector by back-propagating image-level labels through multiple object detection streams. Based on [37], another work proposed is proposal cluster learning using image level annotations for object detection [38]. This is an iterative process and assign labels on the basis of proposal clusters for refinement of instance classifier. TS2C [39] exploits the weakly supervised object segmentation task to help the MIL based weakly supervised object detectors to concentrate on the whole object rather than just the discriminative parts. We present novel method using weakly supervised orientation information and axis-aligned bounding boxes for object detection with applications to firearms. **

Small Object Detectors: In some small objects such as human faces contextual information may have significance for learning deep model [40]. Similarly, features from RPN have been used for small-sized pedestrian detection [11]. Singh et al. proposed scale normalized training to address the problem of extreme-scale variations [25]. Liu et al [41] also emphasized the significance of context and instance relationship for accurate object detection. However, in the case of firearms, most of the contextual objects may remain irrelevant and behave as noise by suppressing the required object information.

Oriented Object Detectors: Most of the recent oriented object detectors [15, 20, 42, 34, 16, 5, 35] are in the domains of document processing or remote sensing where objects are detected in aerial imagery (Table 1) and use OBB to train in a fully supervised way. OBB predicting methods try to handle challenges like dense objects, arbitrary orientation, and background noise. Ma et al [15] used the rotational formation of anchors at RPN level for text detection. Yang et al [20] used attention to improve dense objects detection in arbitrary orientations. Ding et al [5] proposed a rotated RoI transformer in a fully supervised way to reduce the number of anchors at RPN level. Nevertheless, as indicated in Table 1, existing oriented object detection methods use OBB information as ground truth during training. In contrast, we propose a cascaded approach to detect oriented objects in a weakly-supervised way, using orientation and axis-aligned bounding boxes during training.

Firearms Detectors: Research on visual firearm detection in images or videos is quite sparse and currently, there is no dedicated firearm detector or benchmark dataset for evaluation and comparison. Olmos et al. used FRCNN for only handgun detection [43], while no results are reported on the rifle. Akcay et al. used FRCNN, RFCN, Yolo v2 and RCNN for gun detection in x-ray baggage security imagery [44]. In contrast to these existing approaches, in the current work, we propose a generic firearm detection and classification framework. The proposed framework is more comprehensive and does not require the OBB ground truth information during training. To the best of our knowledge, the proposed firearm detection framework is novel and has not been proposed before us.

3 Proposed Orientation Aware Object Detector

Most of the current object detectors predict axis-aligned bounding boxes (AABB) and for that, they analyze the features pooled from the axis-aligned window. Uniform pooling from an axis-aligned window may incur features containing noise due to uncorrelated background objects in the window, as shown in 2, adversely effecting object detection performance. To overcome this issue, we propose an Orientation Aware Object Detector (OAOD), consisting of a cascade of two stages (3). The proposed network takes an entire image as input, localizes the firearms and simultaneously classifies them into rifles and guns. For localization, it predicts both Oriented Bounding Boxes (OBB) and axis-aligned bounding for firearms. Unlike other oriented object detectors [15, 20, 34, 16, 5, 33], OAOD does not use OBB ground truth for training. It instead relies on only the orientation information, which is much easy to annotate than OBB in the ground-truth, and learn to predict the OBB in a weakly supervised way. In the following, both the stage-1 and the stage-2 are explained in more detail.

3.1 OAOD Stage-1

The stage-1 of OAOD consists of a Region Proposal Network (RPN) followed by a firearm localization, classification and orientation estimation network.

3.1.1 Region Proposal Network (RPN)

The RPN is retrained on the firearms training dataset similar to [6]. The RPN is applied to the deep features computed by VGG16 [45] backbone model to generate initial axis-aligned region proposals, $RP_{1}$ which are then input to the next step.

3.1.2 Object and Orientation Classification Network

During training, each region proposal $\in RP_{1}$ is associated with a unique ground-truth bounding box, based on maximum IoU between that proposal and the ground truth if maximum IoU is $\geq$ 0.50. On the basis of this association, class label, orientation label, orientation offset, and bounding box offsets are assigned to that proposal. If maximum IoU is $<0.5$ but $\geq 0.1$ , that particular proposal is labeled as background, while the others are rejected. Thus the region proposals may have classification labels as background, gun or rifle.

RoI pooling similar to FRCNN is used to pool the features to a fixed size for further processing. These features are input to a network consisting of two fully connected layers with four separate output heads, one for each of the four tasks: object classification, orientation classification, bounding box and orientation offsets regression. The cross entropy loss function for firearm classification as gun, rifle, and background in stage-1 ( $L^{f}_{1}$ ) is defined as:

[TABLE]

where $p^{f}_{1}\in\mathcal{R}^{n_{f}}$ is the predicted firearm class probability and $u^{f}_{1}=\{\{1,0,0\},\{0,1,0\},\{0,0,1\}\}\in\mathcal{R}^{n_{f}}$ is the actual firearm class label, $n_{f}=3$ is the number of object classes including background, gun, and rifle, and $n_{b}$ is the number of object proposals in a mini batch. To predict the orientation of a region proposal effectively, the objects are divided into $n_{o}=8$ orientation classes in the range of 0o- 180o as shown in Fig. 5. The other half-circle contains objects pointing in the exact opposite direction, which are also considered in the same classes as the corresponding class in the upper half-circle. For each region proposal the orientation classification head predicts a label within the specified $n_{o}$ classes by using the orientation loss function, $L^{o}_{1}$ :

[TABLE]

where $p^{o}_{1}\in\mathcal{R}^{n_{o}}$ is the predicted orientation class probability and $u^{o}_{1}\in\mathcal{R}^{n_{o}}$ is the actual orientation class label, $n_{o}=8$ is the number of orientation classes, and $n_{b}$ are the number of object proposals in a mini batch corresponding to the firearms in the ground truth. Similarly, $\delta_{i}$ is an indicator variable for $i^{th}$ object proposal ensuring to ignore orientation loss during training for the background class, $\delta_{i}=1$ if the label is gun/rifle and [math] otherwise.

3.1.3 Bounding Box Regression

Alongside object classification, accurate localization is also very importance in object detection. We train a bounding box regression head to regress offsets. The objective function for the bounding box regression is given by:

[TABLE]

where $p^{b}_{1}=(p_{x},p_{y},p_{w},p_{h})$ are predicted bounding box offsets and $u^{b}_{1}=(u_{x},u_{y},u_{w},u_{h})$ are actual ground truth offsets for the respective proposal. Also, $n_{b}$ and $\delta_{i}$ are the same as defined above. The S ${}_{\ell_{1}}(\cdot)$ is smooth $\ell_{1}$ function

[TABLE]

During training, $L_{1}^{b}$ is back-propagated for only those object proposals which correspond to firearms in the ground truth while the others corresponding to the background are ignored by using the indicator variable $\delta_{i}$ .

3.1.4 Orientation Offsets Regression

In addition to considering orientation as a classification task, we also rectify the predicted class mean angle (center of the bin as described in Sec. 4) based on the continuous-valued orientation ground truth ( $r_{gt}$ ). For our work, one orientation class represents degree-range : $u^{o}_{1}-r_{m}$ to $u^{o}_{1}+r_{m}$ , as described in Sec. 4, where $r_{m}$ is equal to half of the bin size. In our experiments $r_{m}=11.25^{o}$ since we have set number of classes to 8. The orientation offset is measured as the deviation of $u^{o}_{1}$ from the ground truth $r_{gt}$ . The offset is then normalized using $r_{m}$ in the range of $[-1,1]$ as follows

[TABLE]

where $r_{gt}$ is subtracted from the mean angle $u^{o}_{1}$ associated with classification task and normalized with an absolute value $r_{m}$ to obtain ground truth offsets for rectification of predicted mean angle class $p_{1}^{o}$ . At inference, offsets predicted are scaled back to the original values followed by addition to the mean angle obtained by classification. This helps in better localizing the oriented area to pool features, removing background noise and clutter more effectively than procedure followed in Sec 3.1.2. The loss function for orientation offsets regression is as follows

[TABLE]

where $p^{r}_{1}$ represents the regressed orientation offsets and $u^{r}_{1}$ shows orientation ground truth offsets, $n_{b}$ and $\delta_{i}$ are the same as defined above. The S ${}_{\ell_{1}}(\cdot)$ is smooth $\ell_{1}$ as defined in (4).

The overall training objective function for OAOD stage-1 is a weighted combination of the individual losses of object and orientation classification along with bounding box and orientation offsets regression

[TABLE]

where $\alpha$ , $\beta$ , $\gamma$ , and $\eta$ are normalization weights to assign relative importance to each term in the objective function. The bounding box offset regression targets $u^{b}_{1}$ are also normalized within the same range of [-1,+1].

3.2 OAOD Stage-2

Output of the stage-1 is offsets for the Axes Aligned Bounding Boxes (AABB), their orientation class, orientation offset, and firearm classification result. In order to remove the noisy features belonging to the background, we use the output of the stage-1 to generate Oriented Region Proposals ( $ORP$ ) by Oriented Proposal Generation (OPG) module, perform OARoI-Pooling for these proposals before presenting to the stage-2 classifier and regressor that generates Oriented Bounding Boxes (OBB). These steps are discussed in more detail in the following sections.

3.2.1 Updating Region Proposals for Stage-2

The bounding box offsets $p^{b}_{1}$ , output by stage-1, are used to translate and scale region-proposals $RP_{1}$ to get $RP_{2}$ , $RP_{2}=RP_{1}+p^{b}_{1}$ . Therefore, $RP_{2}$ IoU with the corresponding ground truth bounding box $u^{b}_{1}$ may get changed requiring a revision of labels and training offsets for this stage. New ground-truth labels ( ${u^{f}_{2},u^{b}_{2}}$ ) for each updated region proposal are recomputed by considering its IoU with the ground-truth bounding boxes.

3.2.2 Oriented Proposal Generation

For $i_{th}$ region proposal, adding the predicted offset $p_{1}^{r}(i)$ to the predicted mean angle $p_{1}^{o}(i)$ of the orientation class, we compute the updated orientation $\theta_{i}$ , where $\theta_{i}=p^{o}_{1}(i)+r_{m}\times p_{1}^{r}(i)$ , and $r_{m}=11.25^{o}$ in our case (Sec. 3.1.4). This updated angle is then used to generate Oriented Region Proposals ( $ORP$ ), aligned with the firearm object. If $RP_{2}$ is directly rotated using $\theta_{i}$ , it gets aligned with the firearm but it is not compact and may encapsulate even more background information than the original $RP_{2}$ . To address this issue a maximum area oriented rectangle is computed inside $RP_{2}$ such that the longitudinal axis of this rectangle is aligned with $\theta_{i}$ as shown in Fig. 4. This oriented rectangle removes the extra background information, however, in many cases, it does not cover the full length of the object. Therefore, to obtain $ORP$ , the maximum area rectangle is extended along the longitudinal axis till the corners of $RP_{2}$ . As shown in Fig. 4(c), the $ORP$ is aligned with the axis of the object and is relatively tighter than both $RP_{2}$ and its rotated version (green rotated box in Fig. 4 (a)).

3.2.3 Object Aligned RoI-Pooling

Since $ORP$ consists of rectangles that are not axis-aligned, the pooling algorithm is modified. We define an Object Aligned RoI-Pooling (OARoI-Pooling) process to pool values from the ORP. Unlike RoI pooling in stage-1, the OARoI-Pooling pools values from an oriented grid instead of an axis-aligned grid where the oriented grid is generated over the ORP.

The oriented feature map resulted from OARoI-Pooling is used for the final classification of the object and the bounding box regression (see Fig. 3 (d)). It should be noted that after OARoI-Pooling, the pooled values become invariant to the orientation of the object in $RP_{1}$ thus making easier for the classifier to perform prediction.

3.2.4 Oriented Object Detection

An oriented object detection sub-network is trained to predict the classification score and bounding box offsets over OARoI-pooled features. These offsets are then applied to the $ORP$ before further processing. The design of oriented object detection sub-network layers is similar to the stage-1 object classification and bounding box regression layers. The objective function for this sub-network consists of two losses including oriented object classification loss and bounding box regression loss . The cross entropy loss $L^{f}_{2}$ is given by:

[TABLE]

where $p^{f}_{2}\in\mathcal{R}^{n_{f}}$ is the predicted firearm class probability and $u^{f}_{2}\in\mathcal{R}^{n_{f}}$ is the updated firearm class label, while the rest of parameters are similar to (1). The bounding box regression loss $L^{b}_{2}$ is as follows:

[TABLE]

where $\Theta_{i}$ is an indicator variable such that, $\Theta_{i}=1$ if orientation is $0^{o}$ or $90^{o}$ and $\Theta_{i}=0$ otherwise. Hence, loss is backpropagated only if the firearm is vertically or horizontally axis-aligned. For these two angles, the $RP_{2}$ and the $ORP$ remain the same; hence ground truth boxes could be used to train the stage-2 bounding box regression. The combined objective function for this oriented object detection sub-network is given below:

[TABLE]

3.2.5 Oriented Bounding Boxes Output

The bounding box offsets, $p_{2}^{b}$ , predicted in stage-2 are used to update $ORP$ . An inverse transformation is constructed to map this adjusted $ORP$ back to the original image, using the orientation $\theta_{i}$ and the $RP_{2}$ center positions from stage-2. The output of this final step gives us OBB. The step-wise details are provided in Algorithm-1.

Note that: during the inference time, we use bounding box output by stage-1 as the AABB, and OBB generated by stage-2. However, in both cases, we use the classification probability from stage-2.

4 ITU Firearms Dataset (ITUF)

We have collected a large dataset of images containing firearms, named as ITUF. Axis-aligned bounding box (AABB) of each firearm in each image has been hand-annotated. Dataset has been divided into training and testing splits, for the testing split OBB were also manually annotated to enable comparison with existing OBB predicting algorithms. As per our knowledge, ITUF is the first large firearm dataset in the public domain. ITUF captures varied scenes (indoor, outdoor, lighting conditions) & scenarios (firearms pointed, carried, lying on tables/ground/racks) and contains various makes and models of firearms (from pistols to AK-47). This diversity makes ITUF a challenging and realistic dataset for the firearm detection task.

**Data Collection and Annotation: ** ITUF was collected from the web by incorporating keywords, such as weapons, wars, pistol, movie names, firearms, types of firearms, sniper, shooter, corps, guns and rifles, in the web search. Results were cleaned to remove images not containing firearms, duplicates and synthetic ones. The final dataset consists of $10,973$ fully annotated images with 13647 firearm instances.

We have divided firearms into two classes; ‘Gun’ class includes different variations of pistols and revolvers; whereas ‘Rifle’ class contains hunting-rifles to AK-47 (including small machine guns). AABB for each firearm in every image is tagged by an annotator, along with a class label and an angle representing the orientation of the firearm. Orientation is annotated as the angle made by line joining muzzle and the back tip (hammer or butt) of the firearm. Orientations are quantized into 8 bins as shown in Fig. 5, and each bin is treated as a class with the value equal to the center of the bin. Each class also includes orientations which are 180o flipped versions of the angles shown in Fig. 5. For example, class 0 spans 348.75o to +11.25o as well as 168.75o to 191.25o. The orientation class associated to each firearm represent the center of the associated quantization bin, and is named as mean angle.

We believe that this dataset will help the researchers to develop algorithms for firearm detection not just for security but also for the multi-media content analysis, including AR and VR environments as well. It will also help media and content distribution companies to categorize what content is feasible for age-appropriate consumption. More dataset details may be found in the supplementary material and at the project homepage111http://im.itu.edu.pk/orientation-aware-firearms-detection/.

5 Experiments and Results

The proposed OAOD algorithm is trained and evaluated on the ITUF dataset and is compared with current state-of-the-art axis-aligned and oriented object detection algorithms. A thorough ablation study is performed to validate different parameters and aspects of the proposed OAOD algorithm.

5.1 Experimental Setup

In firearm detection experiments, we localize each firearm in an input image and classify it as a rifle or a gun. For the comparison with the axis-aligned object detection methods, AABB from stage-1 and classification score from stage-2 are used, whereas the output of stage-2 is used for comparisons with OBB detection methods. OAOD is only trained on AABB and orientation ground-truth information, while OBB are not used for training.

Implementation Details: High-resolution images in ITUF are resized to a shorter side of 480 or larger side of 800 pixels preserving the aspect ratio. Due to limited GPU memory, a single image per batch is processed. The initial learning rate and momentum are set to 0.001 and 0.90 respectively. A weight decay of 0.0005 with SGD optimizer is used. VGG16 pre-trained on Imagenet [45], is used as a backbone network. Caffe is used as an implementation framework and training is performed on a single core-i5 machine with 32GB RAM and a GTX 1080 GPU with 8GB memory. The hyper parameters in (7), $\alpha$ , $\gamma$ & $\eta$ are set to 1.0. The parameter $\beta$ is set to 0.1, by validating over a wide range in search of optimal value (Sec. 5.3).

**Training Scheme: ** We train the two OAOD stages one by one. Initially stage-1 (Fig. 3) is trained to predict the AABB, classification score and orientation information with loss function ${L}_{1}$ (7). After a sufficient number of epochs, we train stage-2 along with fully connected layers of stage-1. The bounding boxes and the orientation information predicted by stage-1 are passed as input to stage-2 as described in Sec. 3.2. $L_{b}^{s2}$ in stage-2 is incorporated only if the orientation for the region proposal is 0o or 90o, while the classification loss is used for every instance as given by (10). To avoid over-fitting, a dropout of 0.5 is used between the fully connected layers during training.

5.2 Comparison with Existing State-of-the-art Techniques

The trained OAOD and other state of the art algorithms are evaluated over the ITUF test set for both AABB and OBB predictions. To understand the effect of orientation prediction we present results on the both sub-tasks, orientation classification (see Sec. 3.1.2) and orientation regression (see Sec. 3.1.4). OAOD pipelines only with orientation classification are named OAOD-AA & OAOD-OB (as the AABB and OBB predictions respectively), whereas OAOD-AA+ & OAOD-OB+ are used for regression, that is with orientation offsets added in the pipeline (see Fig. 3). Our model (with orientation regression) produces mAP of 88.3% and 77.5% (at IoU=0.50) for AABB and OBB respectively. Due to $ORP$ offsets regression and stage-2 classification, the proposed OAOD avoids miss-detection and multiple detections while performing more accurate localization. Employing a deeper backbone network such as ResNet-101 [46] may result in further improved accuracy at the cost of increased space complexity.

5.2.1 Comparison with Axis-Aligned Bounding Box Methods

We compare OAOD against current state-of-the-art one-stage and two-stage AABB object detection algorithms such as SSD, DSSD, YOLOv2, YOLOv3 and FRCNN. These methods were trained on the same ITUF training dataset. All parameters in these algorithms were set as recommended by the original authors.

The proposed OAOD has outperformed the compared methods by achieving better mAP compared to both single-stage and multi-stage detectors as shown in Table 2. This is attributed to $ORP$ generated by the OPG module with OARoI-Pooling which removes much of the noisy features related to the background, making the stage-2 more accurate. Secondly, OAOD remains stable (Table 2) as IoU levels are varied, despite the fact that the model is trained for IoU=0.50 only. Specifically compared to the baseline FRCNN, our proposed OAOD-AA and OAOD-AA+ have achieved increased mAP by 6.6% and 9.6% respectively, at IoU=0.50. Compared to the single-stage axis-aligned object detectors, the OAOD-AA and OAOD-AA+ improve the performance by a minimum of 14.0% and 16.9% respectively. The qualitative results are presented in Fig. 6. The proposed OAOD performs excellently by avoiding miss detections and produces better localization.

5.2.2 Comparison with Oriented Object Detection Methods

Most of these methods are trained using OBB ground truth annotations, while such information is not available in the case of the ITUF dataset.

In order to train the existing oriented bounding box detection methods (R2CNN++ [20], DOTA FRCNN [21], RoI-Trans [5]), we rotate the given AABB ground-truth with respective ground-truth orientation and create OBB. OAOD overcomes the limitation of the unavailability of OBB ground-truth by leveraging the orientation information (stage-1), ORP with OARoI-Pooling and inverse transformation (stage-2). Table. 3 shows the comparison of OAOD-OB & OAOD-OB+ with the existing state-of-the-art oriented object detection algorithms. OAOD gives more stable results for different IoU values compared to other methods. For a comprehensive evaluation, we have tested the proposed OAOD algorithm and the existing OBB detection methods with rotated OBB ( $OBB_{rot}$ ) and annotated OBB ( $OBB_{ann}$ ). The $OBB_{rot}$ are the rotated version of AABB and $OBB_{ann}$ are the manually tagged boxes (available for test set only). More specifically, in the case of $OBB_{rot}$ and $OBB_{ann}$ at IoU=0.5, the proposed OAOD outperformed the existing state-of-the-art methods by a minimum of 5.8% and 11.6% mAP, respectively.

Fig 7 shows the qualitative results of OAOD-OB & OAOD-OB+ with the existing state-of-the-art oriented object detection methods. More Qualitative results of OAOD are provided in the supplementary material.

To compare the robustness and stability of the OAOD vs existing methods, we evaluate results at different confidence levels. As indicated in Fig. 8 (right), proposed OAOD-OB & OAOD-OB+ algorithms remain stable and even at high confidence threshold, OAOD results less deteriorate compared to the other methods [5, 21, 20]. This could be attributed to the cascaded nature of OAOD and removal of the noisy background features by generating Oriented Proposals and performing Object Alighted RoI-Pooling.

5.3 Ablation Study

Thorough ablation study is performed in order to evaluate different design choices including hyper-parameters and OAOD precursor models (Fig. 8 (left)).

Orientation Loss: With other hyper-parameters in stage-1 loss (7) fixed to 1 (following original FRCNN paper), we searched a wide range for the optimal value of $\beta$ , by training and validating on the ITUF validation-set chosen as 20% of the training dataset. $\beta=0.10$ has resulted in increased orientation accuracy as well as mean average precision (Table 4), therefore $\beta=0.1$ is used for the rest of the experiments.

Orientation Regression Loss: Similar to orientation loss in (7), we search for the optimal value of of orientation regression loss scaling parameter $\eta$ . With other hyper-parameters in stage-1 loss (7) fixed, $\alpha=1$ , and $\gamma=1$ (following original values used in FRCNN [6]) and $\beta=0.1$ , the optimal value of $\eta$ is searched by training on the training set (chosen as 80% of the training dataset) and validating on the ITUF validation-set (chosen as 20% of the training dataset). $\eta=1.0\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ 0.5$ has shown comparative mAP (Table 5), however, $\eta=1.0$ have minimum absolute orientation error. Based on this observation, $\eta=1.0$ is used for the rest of the experiments.

**Orientation Classes Distribution: To find the effective orientation classes $n_{o}$ , We have validated orientation classes distribution, while having all the hyper-parameters ( $\alpha=1,\leavevmode\nobreak\ \beta=0.1,\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \gamma=1$ ) in stage-1 fixed. The orientation classes distribution, respective mAP, orientation accuracies and orientation absolute errors (mean of the absolute differences between the predicted orientations and groud truth orientation) the are shown in Table. 6. For $n_{o}=4$ , the orientation accuracies are higher but the respective mAP values are dropped significantly along with high orientation absolute error. This is due to the fact, that the model have to classify the orientation in less number of classes which is an easy task compared to more classes with the cost of decrease in mAP and increase in absolute orientation error. Similarly, for more orientation classes, e.g., $n_{o}=12$ , the orientation performance decreases due to very close mean angle values causing an increase in absolute orientation error, however a slight increase in mAP is reported. The similarly, we repeated the experiment for $n_{o}=8$ , resulting low absolute error with comparative mAP. Since orientation being the main component of the proposed OAOD algorithm, we choose $n_{o}=8$ orientation classes with high mAP compared to $n_{o}=4$ classes and with less orientation absolute error compared to $n_{o}=12$ orientation classes as shown in blue color in Table 6. **

Stage-1 Net: The stage-1 (3) of the proposed OAOD described in Sec. 3.1 is also evaluated for axis-aligned firearm detection. It is noted that, incorporating the orientation information using multi-task learning has improved the performance over the baseline FRCNN by 3.1% mAP (Table. 7).

2-Loss Net: In this experiment, the proposed cascaded model (Fig. 3) is trained by using only AABB regression loss from stage-1 and classification loss from stage-2 as $L_{2L}=L^{f}_{2}+L^{b}_{1}$ . The other losses are not used in this experiment. The $RP_{1}$ are used for generating $ORP$ along with OARoI-Pooling, unlike the $RP_{2}$ used in OAOD. Compared to FRCNN the 2-Loss Net improves the mAP by 3.8% (IoU=0.50) that shows the significance of our basic framework. 2-Loss net’s mAP remains 6.1% less than OAOD mAP that highlights the importance of the remaining losses.

Failure Cases: Fig. 9 shows two failure cases by proposed OAOD. The left image characterizes the case where the firearm is itself not detectable (due to viewing angle, pose, color, etc.). Only information is in the pose of the holder, we intend to explore connection between human pose and firearm localization in future work. In the right one, failure is only a partial one, as the main component has been localized with the correct orientation, while the barrel has been missed due to the occluded portion. One of the possible reasons could be the lack of such occluded objects in the training data.

6 Conclusion

Rising gun violence, and the use of firearms in both electronic media and social media, poses a challenge for the security, surveillance and multi-media content curation domains. However, there has been no concrete effort in the direction of the firearm detection problem. We counter it by, first, introducing a large challenging dataset of images containing firearms, named ITUF dataset which consists of $10973$ images, where all the firearm instances have been hand-annotated. Secondly, we propose a novel firearm detector using the oriented object detection technique for the firearm detection problem. Firearms, being thin (and many being elongated), and mostly held in oriented poses are perfect fit for the oriented object detection problem.

For this purpose, an Orientation Aware Object Detector (OAOD) architecture is designed that can detect tight oriented bounding boxes (OBB) while being trained in weakly supervised fashion using only axis-aligned bounding boxes (AABB) and orientation information. OAOD, is designed to be a multistage detector such that at the last stage features become independent of the object’s orientation. Such a setup simplifies the task of classification and bounding box regression improving its accuracy. To keep the number of anchor boxes small at RPN level, orientation is not associated with the region proposals. Instead, an orientation prediction module is introduced, that predicts the orientation from every axis-aligned proposal classified as a firearm. Predicted orientation is used in an oriented region proposal generation step that allows sampling of features around the region aligned with the orientation of the object inside the AABB predicted in the last stage. We train OAOD, to detect OBB around the firearms, and classify them into two broad classes, guns and rifles. The experimental results (mAP: 88.3 on AABB & mAP: 77.5 for OBB) demonstrate the effectiveness and stability of the proposed method compared with the existing state-of-art methods.

Acknowledgment:

We greatly appreciate the assistance from Muhammad Faisal and Anza Shakeel in collecting and annotating the dataset, and Jason Chi for discussions and providing useful comments.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] How many school shootings in 2018 so far?, https://www.theguardian.com/world/2018/feb/14/school-shootings-in-america-2018-how-many-so-far , [Accessed: 2020-05-25].
2[2] Mass shootings gun violence, https://www.theguardian.com/us-news/ng-interactive/2017/oct/02/america-mass-shootings-gun-violence , [Accessed: 2020-05-25].
3[3] Santa fe shooting, https://time.com/5282496/santa-fe-high-school-shooting-2018/ , [Accessed: 2020-05-25].
4[4] C. J. Howard, T. Troscianko, I. D. Gilchrist, A. Behera, D. C. Hogg, Suspiciousness perception in dynamic scenes: a comparison of cctv operators and novices, Frontiers in human neuroscience 7 (2013) 441.
5[5] J. Ding, N. Xue, Y. Long, G.-S. Xia, Q. Lu, Learning roi transformer for oriented object detection in aerial images, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2849–2858.
6[6] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in neural information processing systems, 2015, pp. 91–99.
7[7] G. van Voorthuijsen, H. van Hoof, M. Klima, K. Roubik, M. Bernas, P. Pata, Cctv effectiveness study, in: Proceedings 39th Annual 2005 International Carnahan Conference on Security Technology, IEEE, 2005, pp. 105–108.
8[8] N. Sulman, T. Sanocki, D. Goldgof, R. Kasturi, How effective is human video surveillance performance?, in: 2008 19th International Conference on Pattern Recognition, IEEE, 2008, pp. 1–3.