GLOW: Global Layout Aware Attacks on Object Detection

Buyu Liu; BaoJun; Jianping Fan; Xi Peng; Kui Ren; Jun Yu

arXiv:2302.14166·cs.CV·March 9, 2023

GLOW: Global Layout Aware Attacks on Object Detection

Buyu Liu, BaoJun, Jianping Fan, Xi Peng, Kui Ren, Jun Yu

PDF

Open Access

TL;DR

GLOW introduces a novel global layout-aware adversarial attack method for object detection that explicitly considers scene context and layout constraints, improving attack success rates across various scenarios.

Contribution

It is the first approach to generate layout-aware adversarial attacks with explicit semantic and geometric constraints for object detection.

Findings

01

Achieves about 30% improvement over state-of-the-art in single object attacks.

02

Outperforms SOTA by 20% on generic attack requests.

03

Excels in zero-query black-box attack scenarios with 20% better performance.

Abstract

Adversarial attacks aim to perturb images such that a predictor outputs incorrect results. Due to the limited research in structured attacks, imposing consistency checks on natural multi-object scenes is a promising yet practical defense against conventional adversarial attacks. More desired attacks, to this end, should be able to fool defenses with such consistency checks. Therefore, we present the first approach GLOW that copes with various attack requests by generating global layout-aware adversarial attacks, in which both categorical and geometric layout constraints are explicitly established. Specifically, we focus on object detection task and given a victim image, GLOW first localizes victim objects according to target labels. And then it generates multiple attack plans, together with their context-consistency scores. Our proposed GLOW, on the one hand, is capable of handling…

Tables4

Table 1. Table 1: Experimental setup. Our victim model is trained on coco17train only and the victim images are from coco17val and Pascal. The former consists of 3792 images with 80 categories while the latter has 500 images of 20 categories.

Victim $f$	Whi.	$𝔽$	$𝔽$ + $𝕐$
Victim $f$	Blk.	$𝔽 \to 𝔻$	$𝔽$ + $𝕐 \to 𝕋$
Victim set $𝒟$		coco17val (3792/80)	Pascal (500/20)
Victim model $f$ is trained on and $𝒯$			coco17train

Table 2. Table 2: Overall performance of R1. As described in Sec. 3.1 , we have three different target labels, R1-5, R1-50 and R1-95, for each victim object and we report the results on all of them. We highlight the best and second best with bold and underline respectively.

Methods	White-box (coco17val/Pascal)						Zero query black-box (coco17val/Pascal)
	R1-5		R1-50		R1-95		R1-5		R1-50		R1-95
	F	F+R	F	F+R	F	F+R	F	F+R	F	F+R	F	F+R
TOG [13]	.64/.67	.11/.16	.75/.77	.15/.22	.87/.82	.20/.27	.08/.13	.01/.04	.16/.19	.02/.08	.23/.27	.03/.13
TOG+RAND	.45/.48	.06/.08	.54/.61	.08/.14	.58/.66	.07/.14	.12/.10	.01/.02	.21/.17	.03/.04	.27/.26	.04/.06
TOG+SAME	.89/.52	.18/.13	.90/.68	.18/.22	.91/.75	.18/.22	.21/.10	.01/.04	.34/.22	.03/.11	.38/.31	.03/.11
Cai [7]	.86/.46	.09/.08	.87/.63	.09/.09	.90/.74	.07/.11	.18/.08	.01/.03	.29/.19	.02/.08	.34/.30	.02/.11
GLOW	.85/.61	.20/.20	.87/.76	.22/.28	.89/.79	.21/.29	.21/.11	.02/.05	.30/.22	.03/.12	.35/.33	.04/.18

Table 3. Table 3: Overall performance of R2. Similar to R1, we have three different target labels for victim image. Since the victim object location is not provided in R2, T , F+T and E+R reflects different aspects of layout consistency.

Methods	White-box (coco17val / Pascal)
	R2-5			R2-50			R2-95
	T	F+T	E+R	T	F+T	E+R	T	F+T	E+R
TOG [13]	.18 / .18	.31 / .38	.17 / .14	.20 / .20	.41 / .44	.24 / .21	.22 / .20	.49 / .34	.25 / .15
TOG+RAND	.19 / .22	.22 / .29	.04 / .09	.18 / .18	.27 / .35	.06 / .23	.22 / .20	.32 / .34	.06 / .15
TOG+SAME	.18 / .23	.45 / .35	.21 / .22	.20 / .22	.51 / .43	.20 / .27	.23 / .19	.55 / .38	.20 / .26
Cai [7]	.21 / .24	.44 / .30	.11 / .10	.20 / .22	.48 / .38	.11 / .15	.24 / .19	.53 / .37	.09 / .16
GLOW	.38 / .35	.64 / .50	.32 / .24	.40 / .35	.67 / .55	.35 / .29	.44 / .33	.69 / .48	.32 / .29
	Zero query black-box (coco17val / Pascal)
TOG [13]	.25 / .26	.04 / .08	.01 / .04	.24 / .25	.08 / .15	.02 / .12	.28 / .21	.12 / .12	.03 / .15
TOG+RAND	.20 / .28	.05 / .08	.01 / .03	.20 / .20	.08 / .12	.01 / .05	.29 / .25	.14 / .15	.03 / .07
TOG+SAME	.23 / .30	.12 / .10	.02 / .06	.22 / .25	.20 / .19	.03 / .18	.27 / .26	.25 / .17	.05 / .16
Cai [7]	.26 / .37	.10 / .09	.02 / .07	.25 / .26	.15 / .12	.02 / .13	.30 / .21	.20 / .14	.02 / .18
GLOW	.37 / .38	.17 / .14	.03 / .10	.38 / .35	.22 / .18	.04 / .15	.44 / .38	.29 / .23	.05 / .19

Table 4. Table 4: Overall performance of R3. Compared to R2, our C+R accounts for both layout consistency and amount restriction.

Methods	White-box (coco17val / Pascal)
	R3-5			R3-50			R3-95
	T	F+T+C	C+R	T	F+T+C	C+R	T	F+T+C	C+R
TOG [13]	.18 / .21	.35 / .28	.10 / .05	.19 / .15	.43 / .37	.13 / .16	.22 / .19	.50 / .35	.14 / .17
TOG+RAND	.18 / .30	.27 / .24	.08 / .05	.19 / .20	.33 / .32	.11 / .16	.22 / .19	.38 / .31	.12 / .16
TOG+SAME	.18 / .23	.14 / .16	.11 / .06	.20 / .21	.15 / .18	.12 / .14	.22 / .19	.17 / .16	.12 / .17
Cai [7]	.20 / .25	.17 / .10	.02 / .02	.21 / .24	.22 / .16	.02 / .06	.23 / .22	.21 / .13	.02 / .04
GLOW	.32 / .28	.48 / .27	.13 / .07	.34 / .28	.52 / .35	.15 / .12	.35 / .27	.53 / .37	.14 / .11
	Zero query black-box (coco17val / Pascal)
TOG [13]	.21 / .32	.01 / .01	.00 / .01	.21 / .23	.03 / .03	.01 / .03	.28 / .26	.04 / .04	.02 / .05
TOG+RAND	.22 / .31	.01 / .01	.00 / .01	.21 / .21	.02 / .03	.01 / .03	.28 / .25	.04 / .04	.01 / .04
TOG+SAME	.21 / .32	.01 / .01	.00 / .01	.22 / .23	.02 / .01	.01 / .04	.27 / .25	.03 / .03	.02 / .05
Cai [7]	.25 / .35	.01 / .01	.00 / .01	.25 / .31	.01 / .00	.01 / .01	.30 / .27	.02 / .02	.00 / .03
GLOW	.31 / .35	.02 / .01	.01 / .03	.34 / .38	.04 / .02	.01 / .02	.37 / .36	.04 / .04	.01 / .03

Equations8

v_{d} (c_{p}) = \frac{1}{N} n \sum v (c_{p}, s_{n}); \forall c_{p} \in / S_{n}

v_{d} (c_{p}) = \frac{1}{N} n \sum v (c_{p}, s_{n}); \forall c_{p} \in / S_{n}

w_{p} (x) = \frac{1}{Q} q \sum π_{q}^{p} \times p d f_{q}^{p} (x)

w_{p} (x) = \frac{1}{Q} q \sum π_{q}^{p} \times p d f_{q}^{p} (x)

s_{1} (I_{t}) = \frac{1}{X} l_{p}^{*} \in O_{*} \sum s (l_{p}^{*}); \forall I_{t} \in T^{*}

s_{1} (I_{t}) = \frac{1}{X} l_{p}^{*} \in O_{*} \sum s (l_{p}^{*}); \forall I_{t} \in T^{*}

δ_{t}^{*} = ar g δ \in S_{M} max l_{n} \in / O_{*} \sum L_{m} (l_{n}, l_{δ (n)}^{t}) = ar g δ \in S_{M} max l_{n} \in / O_{*} \sum L 1 (l_{n}, l_{δ (n)}^{t}) + G I o U (l_{n}, l_{δ (n)}^{t})

δ_{t}^{*} = ar g δ \in S_{M} max l_{n} \in / O_{*} \sum L_{m} (l_{n}, l_{δ (n)}^{t}) = ar g δ \in S_{M} max l_{n} \in / O_{*} \sum L 1 (l_{n}, l_{δ (n)}^{t}) + G I o U (l_{n}, l_{δ (n)}^{t})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Forensic Toxicology and Drug Analysis

MethodsInvertible 1x1 Convolution · Affine Coupling · Normalizing Flows · Activation Normalization · GLOW

Full text

GLOW: Global Layout Aware Attacks on Object Detection

Jun Bao*

Buyu Liu*

Jianping Fan

Xi Peng

Kui Ren

Jun Yu

Abstract

Adversarial attacks aim to perturb images such that a predictor outputs incorrect results. Due to the limited research in structured attacks, imposing consistency checks on natural multi-object scenes is a promising yet practical defense against conventional adversarial attacks. More desired attacks, to this end, should be able to fool defenses with such consistency checks. Therefore, we present the first approach GLOW that copes with various attack requests by generating global layout-aware adversarial attacks, in which both categorical and geometric layout constraints are explicitly established. Specifically, we focus on object detection task and given a victim image, GLOW first localizes victim objects according to target labels. And then it generates multiple attack plans, together with their context-consistency scores. Our proposed GLOW, on the one hand, is capable of handling various types of requests, including single or multiple victim objects, with or without specified victim objects. On the other hand, it produces a consistency score for each attack plan, reflecting the overall contextual consistency that both semantic category and global scene layout are considered. In experiment, we design multiple types of attack requests and validate our ideas on MS COCO and Pascal. Extensive experimental results demonstrate that we can achieve about 30 $\%$ average relative improvement compared to state-of-the-art methods in conventional single object attack request; Moreover, our method outperforms SOTAs significantly on more generic attack requests by about 20 $\%$ in average; Finally, our method produces superior performance under challenging zero-query black-box setting, or 20 $\%$ better than SOTAs. Our code, model and attack requests would be made available.

1 Introduction

Object detection aims to localize and recognise multiple objects in given images with their 2D bounding boxes and corresponding semantic categories [14, 19]. Due to the physical commonsense and viewpoint preferences [16], detected bounding boxes in natural images are not only semantically labeled but also placed relative to each other within a coherent scene geometry, reflecting the underlying 3D scene structure. Such bounding box representation allows us to derive a notion of both semantic and geometric constraints. For example, co-occurrence matrix is a commonly exploited semantic constraint where certain object categories are more likely to co-occur, e.g., bed and pillow [20]. Geometric constraints, on the other hand, leverage the inductive bias of scene layout [11], such as when oc-occurring in a scene, traffic light is more likely to be appeared on the upper region with a smaller bounding box compared to car.

Adversarial attacks on object detectors mainly focus on targeted victim setting [47, 7] where the goal is to perturb a specific victim object to target class. In this case, the location, ground truth and target class of the victim object are assumed to be known to attackers. Naturally, contextual cues are leveraged in attack and defense mechanisms [50, 7, 5] on detectors to enhance or detect holistic context (in)consistency [5]. Though being well-motivated and demonstrating good performances in conventional setting, the state-of-the-art methods [7, 5] suffer the following problems in practice. Firstly, the assumption of known location and ground truth label of victim object might be strong due to annotation cost [2]. Therefore more vague attack requests where victim objects are not specified, e.g. show me an apple and a chair, should be considered in practice, which are beyond the existing methods. Secondly, global geometric layout is commonly neglected as existing methods either model semantic co-occurrence [5] or consider relative sizes and distance w.r.t. given victim object [7].

In this work, we introduce a novel yet generic attack plan generation algorithm GLOW on both conventional and generic attack requests to effectively leverage both categorical and global layout cues, leading to superior white-box attack performance and better transferability in black-box setting. As for generic requests, we firstly loose the assumption of known specific victim object by requesting only the existence of certain target label, e.g. show me category X in image. Compared to conventional setting, our request demands the modelling of the locations and sizes of target label X. Our second request further constrains label amount, e.g. give me N objects of category X and M objects of category Y, which necessitates the global layout of victim image. To fulfill these requests, we propose a novel attack plan generation method GLOW that accounts for both categorical and geometrical relations. Specifically, GLOW aims to figure out the most context-consistent attack plan for each victim image according to its underlying layout while considering the hard constraints, e.g. existence or amount of some target labels under generic requests or a specific victim object under conventional request. The first step in GLOW localizes victim objects with given target label or amount on victim image by modeling the joint distribution of bounding box sizes and centers. And it enables generic attack requests. Given these victim objects, the second step further leverages the layouts of victim images to generate globally context-consistent attack plans with consistency scores. This is achieved by reformulating the generation as a layout similarity measurement problem. And these consistency scores therefore are similarity scores. Finally, the plan with the highest score would be our selected attack plan. We then implement the selected plan with existing attack generation methods, or attackers. Details of our proposed requests and GLOW can be found in Fig. 1.

We validate our ideas on coco2017val [33] as well as Pascal [18] with both white-box and zero-query black-box settings. And we design new evaluation metrics to measure layout consistency thus mimicking consistency defenses. We demonstrate in white-box setting, our proposed method achieves superior performance with both conventional and proposed generic attack setting compared to SOTAs. More importantly, GLOW provides significantly better transfer success rates on zero-query black-box setting compared to existing methods.

Our contributions can be summarized as follows:

•

A novel method GLOW that is capable of generating context consistent attack plans while accounts for both categorical and geometric layout coherency.

•

Two generic attack requests on coco2017val and Pascal images and one consistency evaluation metric to mimic realistic attack request and delicate attack defenses.

•

State-of-the-art performances on coco2017val and Pascal images under both white-box and zero-query black-box setting. Code, model and requests will be available.

2 Related Work

Object detection The goal of object detection is to predict a set of bounding boxes and category labels for each object of interest. Starting from [14, 19], object detection explored extensive cues, including semantic [27], geometric [48] and other contextual cues [49], to improve its performance as well as interpretability. Recently, deep neural networks (DNNs) [26] have significantly improved many computer vision [26, 21] and natural language processing tasks [23, 15]. Modern detectors follow the neural networks design, such as two-stage models where proposals are firstly generated and then regression and classification are performed [21, 6] and one-stage models [32, 54, 45] that simultaneously predict multiple bounding boxes and class probabilities with the help of pre-defined anchors or object centers. More recently, transformer-based models [8, 55] are proposed to further simplify the detection process by formulating the object detection as a set prediction problem where unique predictions can be achieved by bi-partite matching, rather than non-maximum suppression [24, 4]. Similarly, contextual cues are also explored in modern detectors [3, 53, 12, 36, 1] with various forms. In this paper, we focus on adversarial attacks on DNNs-based detectors. And our GLOW generates contextually coherent attack plans with various requests, which are also transferable to detectors of different architectures.

Adversarial attacks and defenses in object detection Despite impressive performance boosts, DNNs are vulnerable to adversarial attacks, e.g. adding visually imperceptible perturbations to images leads to significant performance drop [22, 44, 9]. Adversarial attacks can be categorized into white-box [22, 37] and black-box [17, 31], depending on whether parameters of victim models are accessible or not. Attacks such as DAG [47], RAP [30] and CAP [52] are architecture-specific white-box attacks on detectors where two-stage architecture is required since they work on proposals generated by the first stage. More generic attacks, such as UAE [46] and TOG [13], are capable of attacking all different kinds of models regardless of their architectures. Compared to the aforementioned methods that perturb the image globally [47], patch-based attacks [34] also showcase their ability in terms of fooling the detectors without touching the victim objects [25]. In contrast, black-box attacks [29, 10, 38, 35, 28] are more practical yet challenging where either a few queries or known surrogate models are exploited to fool an unknown victim model. Observing the impacts of adversarial attacks on detectors, various defense methods are proposed to detect such attacks, wherein contextual cues are explored [50]. However, contextual cues are almost always represented in the form of semantic co-occurrence matrix where global layouts are largely neglected [7, 5]. In contrast, we propose a generic attack plan generation algorithm that leverages both semantic and geometric coherency, e.g. scene layout. Consequently, it manages both conventional single targeted victim setting and generic attack requests where locations are unknown or object amount is further restricted, translating to SOTA performance under white-box and black-box settings.

3 Method

We introduce the attack requests, proposed GLOW and attacker in Sec. 3.1, Sec. 3.2 and Sec. 3.3 respectively.

3.1 Attack requests

To attack a victim image, user may or may not specify victim objects, e.g. providing their locations or labels. Therefore, besides considering conventional attack request where a specific object and its targeted label are given, more generic requests should also be addressed, such as give me 2 cats or mis-classify the rightmost boat to car. Let’s denote $\mathcal{D}$ as the set of multiple victim image $I$ . $\mathcal{C}=\{c_{p}\}_{p=1}^{C}$ is the label space with $C$ semantic categories. Given a known object detector $f$ , which can be the victim model in white-box attack or the surrogate model under black-box setting, we can obtain a set of predicted objects $\mathcal{O}$ on victim image $I$ , or $\mathcal{O}=f(I)=\{l_{n},s_{n}\}_{n=1}^{N}$ , consisting the location and semantic category of $N$ objects. $l_{n}$ defines the location of the $n$ -th object, including its bounding box center coordinates, height and width. And $s_{n}\in\mathcal{C}$ is its semantic label.

R1: mis-classify the object $s_{n}$ to $c_{p}$ . This is the conventional attack request where the $n$ -th object is our specific victim object and $c_{p}$ is the targeted label.

Though one can always choose random object as victim and random category as $c_{p}$ , we observe that the choice of victim object and target label plays an important role in attack performances (See Sec. 4). To this end, we set different selection criteria for victim object and targeted label to evaluate attack methods in various aspects. As for victim object, it is unpractical to assume that ground-truth locations can be provided by the users, e.g. bounding box annotations can be time-consuming [2]. Therefore, we turn to the predictions as reliable sources to help us to determine where the attack should take place. More specifically, the one that has the largest bounding box among all predictions with confidence score above 0.85 will be selected, which gives us good estimation in practice.

As for target label $c_{p}$ , we mainly follow [7, 5] where the out-of-context attack is considered. Specifically, to eliminate the chance of miscounting the existing objects as success, $c_{p}$ is selected if and only if $c_{p}$ is not present in the $I$ . Rather than randomly selecting $c_{p}$ among all unpresented categories [7, 5], our decision is made according to distance in word vector space [51] as it captures the semantic and syntactic qualities of words. Mathematically, for each unpresented $c_{p}$ , we have its averaged distance as:

[TABLE]

where $\mathcal{S}_{n}=\{s_{n}\}_{n=1}^{N}$ and $v(c_{p},s_{n})$ denotes the cosine distance between category $c_{p}$ and $s_{n}$ in word vector space.

To evaluate the impact of target label $c_{p}$ , we collect three $c_{p}$ s according to $v_{d}(c_{p})$ and visualize them in Fig. 2. Specifically, we firstly rank all $c_{p}\notin\mathcal{S}_{n}$ based on $v_{d}(c_{p})$ . Then we choose the top $5\%$ , $50\%$ and $95\%$ ones as our target class $c_{p}$ s, referring as R1-95, R1-50 and R1-5, respectively.

Our ultimate goal of R1 is not only to mis-classify the victim object, but also failing the potential defense of consistency check. Therefore, the challenge of R1 mainly lies in figuring out the attack plan that is contextually consistent and beneficial for the mis-classification in practice.

R2: show me the category $c_{p}$ . Rather than assuming that a specific victim object is known to attackers as in R1, R2 takes one step further in terms of relaxing the attack request. Specifically, R2 comes in a much vague manner where user only specifies the target label $c_{p}$ .

Though it seems that asking for the existence of $c_{p}$ is an easier task compared to R1 as one can always flip a random object to $c_{p}$ , we argue that this conclusion is valid if only coarse semantic consistency check/defense, e.g. co-occurrence matrix [5], is available, which unfortunately neglects geometric context. A more desired consistency check should be capable of capturing both geometric and semantic context, for example, traffic light is less likely to appear on the image bottom while poles usually has slim bounding boxes. And our goal is to fool the victim model and such delicate defense simultaneously.

Therefore, we claim that R2 is more challenging than R1 as it requests additional understanding of the location-wise distribution of target label $c_{p}$ . We kindly note our readers that such challenge is beyond [5, 7] (see Fig. 3). We omit the details of $c_{p}$ in R2-5, R2-50 and R2-95 as they are selected based on the same criteria as that of R1 (See supplementary).

R3: give me multiple $c_{p}$ s. R3 reflects another realistic attack scenario, e.g. have a monitor and a mouse in victim image $I$ . Besides not specifying the victim object by providing only target label information, R3 enforces additional constraint on object amount, making it more challenging. Specifically, multi-object relationship should be considered together with hard restrictions on the amount of objects (See Fig. 4). For example, besides modelling locations of mouse and monitor individually, estimating their layout, e.g. monitor is more likely to be above the mouse, is also essential to achieve context consistent yet fooled predictions.

Theoretically, R3 can be multiple victim objects of the same or different categories, which does not affect our following GLOW method. In practice, when it comes to objects with various categories, additional heuristics are needed to avoid semantic inconsistency as $v_{d}$ does not guarantee contextual consistent combinations. Moreover, such problem becomes more severe with increasing number of objects, together with the emerge of new challenge of underlying constraints on object amount in natural image. For instance, ten apples in $I$ can be natural but not for ten stop signs. Therefore, we leave objects of different categories as our future work and focus on two objects of the same $c_{p}$ . Details of R3-5, R3-50 and R3-95 can be found in supplementary.

3.2 GLOW: Global LayOut aWare attacks

Contextually consistent attack has been discussed in many previous work [7, 5]. The main motivation is that perturbing only the victim object may lead to inconsistency in context thus global attack plan should be considered. Specifically, an attack plan assigns target labels to all objects in victim image, including ones that are not victims originally, to both avoid inconsistency and benefit the attack request. Though well-motivated, existing methods largely rely on semantic context [5], neglecting geometric context such as scene layout. In addition, the ability of modelling prior knowledge, such as having more than ten beds in an image is unlikely to happen while ten books are more plausible, is lacking in literature.

To this end, we propose a novel attack plan generation method GLOW that accounts for both semantic and geometric context, such as object locations and overall scene layout. GLOW consists of two steps. The first localization step aims to locate the victim object based on target labels and their amounts under generic attack requests R2 and R3. Then the second generation step further produces multiple context-consistent attack plans as well as their scores with given victim objects. Afterwards, the plan with the highest score is selected as our final attack plan and then parsed to existing attackers. See Fig. 5 for more details.

Victim object localization We aim to localize victim objects under R2 and R3, where constraints on target labels and/or their amount are available.

Let’s first assume there exist some images from annotated detection dataset, which, in the simplest case, can be the training set that our victim/surrogate model is trained on. We denote this dataset as $\mathcal{T}$ , including $T$ images and their bounding box annotations $\mathcal{A}=\{\mathcal{A}_{t}\}_{t=1}^{T}$ , where $\mathcal{A}_{t}=\{l^{t}_{m},s^{t}_{m}\}_{m=1}^{M}$ is the set of bounding box annotations on the $t$ -th image $I_{t}$ . Similarly, we assume there exist $M$ objects and the $m$ -th object instance has location $l^{t}_{m}$ and semantic category $s^{t}_{m}\in\mathcal{C}$ .

Determining the location of victim object under R2 and R3 is equivalent to estimating the center, height and width of bounding boxes of target label $c_{p}$ . And we formulate the localization as a probability maximization problem. This is achieved by modelling the joint probability of bounding box center, height and width per category. Specifically, for each $c_{p}\in\mathcal{C}$ , we have $\mathcal{L}_{c_{p}}=\{l^{t}_{m}|s^{t}_{m}=c_{p}\}_{t,m}$ , where $l^{t}_{m}$ is normalized by image height and width. Then we apply GMM [39] to fit $q=\{1,\dots,Q\}$ Gaussians $\mathcal{N}^{p}_{q}(\mu^{p}_{q},\delta^{p}_{q})$ on $\mathcal{L}_{c_{p}}$ , where $\mu^{p}_{q}$ and $\delta^{p}_{q}$ are mean and co-variance of $q$ -th Gaussian at class $c_{p}$ . $pdf_{q}^{p}$ and $\pi^{p}_{q}$ are the probability density function and the weight of $\mathcal{N}^{p}_{q}$ respectively. Q is set to 5 based on experiment on $~{}\mathcal{T}$ . Given any $x\in\mathbb{R}^{4}$ , our GMM is able to provide a weighted probability density $w(x)$ by:

[TABLE]

Simply going through all $x$ and choosing ones with highest $w_{p}(x)$ ignore overall scene layout, which might result in significant layout changes, e.g. large bounding box on objectless area or heavy occlusions, leading to less plausible overall layouts. Alternatively, we narrow down our search space to existing bounding boxes and find the optimal location among all $l_{n}$ . As for R2, the victim object can be found by $n^{*}=\arg\max_{n}w_{p}(l_{n})$ . As for R3, we rank and select top ones depending on detailed request, e.g. choose the top 2 if R3 is to have two objects of same target label $c_{p}$ . We then denote the victim objects in $I$ as $\mathcal{O}_{*}=\{l_{p}^{*},c_{p}^{*}\}_{*}$ , where $c_{p}^{*}$ equals to $c_{p}$ in R1 and R2. And $\{c_{p}^{*}\}_{*}$ is the set of requested target labels in R3. Similarly, $l_{p}^{*}$ is $l_{n}$ in R1 and are from estimation in R2 ( $l_{n^{*}}$ ) and R3. We further denote the number of target objects as $X$ and $X=1$ under R1 and R2. Example victim objects $\mathcal{O}_{*}$ can be found in Fig. 6.

Global attack plan generation Given victim object $\mathcal{O}_{*}$ , our next step is to generate target labels on objects that are not victim. Specifically, it aims to find an mapping function $g(s_{n})\in\mathcal{C}$ that perturbs the label of these objects, resulting in $\hat{\mathcal{O}}=\{l_{n},g(s_{n})\}_{n}$ . The overall generated attack plan on $I$ would be $\hat{\mathcal{O}}_{*}=\{\mathcal{O}_{*},\hat{\mathcal{O}}\}$ .

Theoretically, there exist $(N-X)^{C}$ possible configurations in $\hat{\mathcal{O}}$ . Instead of permuting all possible solutions, we restrict ourselves with only feasible ones that occur in existing dataset $\mathcal{T}$ as scene layouts are naturally context-consistent therein. To this end, we formulate our global attack plan generation as a layout similarity measurement problem, with hard constraint on victim objects $\mathcal{O}_{*}$ . Our goal is therefore to map the bounding box labels according to the best match based on layout similarity in $\mathcal{T}$ . Intuitively, the more similar these layouts are, the more confident we are in terms of performing mapping. Therefore the layout similarity score reflects context consistency to some extent. Our insights lie in the following design choice of obtaining mapping function $g$ and score $s$ :

•

Generate $\mathcal{T}^{*}=\{\mathcal{T}_{c_{p}^{*}}\}_{c_{p}^{*}}$ where $\mathcal{T}_{c_{p}^{*}}$ consists of images that have target label $c_{p}^{*}$ presented.

•

Compute the Intersection over Union (IoU) score between victim objects $\mathcal{O}_{*}$ and objects that share the same target labels in $I_{t}\in\mathcal{T}_{c_{p}^{*}}$ , Mathematically,:

[TABLE]

where $s(l_{p}^{*})=\max_{m}\mathbbm{1}_{\{c_{p}^{*}=s^{t}_{m}\}}IoU(l_{p}^{*},l^{t}_{m})$ . The IoU score between victim object location $l_{p}^{*}$ and $m$ -th bounding box in $I_{t}$ is obtained by $IoU(l_{p}^{*},l^{t}_{m})$ .

•

Perform Hungarian matching [43, 8] between objects in $I_{t}$ and those in victim image $I$ . Specifically, we find a bipartite matching between these two sets by searching for a permutation of $M$ elements $\mathbb{S}_{M}$ with the lowest cost:

[TABLE]

where $L1()$ and $GIoU()$ define the L1 and GIoU [42] between bounding boxes. $\delta_{t}^{*}(n)$ is the index of the best match of $n$ -th object which is not victim originally in victim image $I$ . And the match loss of $\delta_{t}^{*}(n)$ can be obtained with $s2(I_{t})=\frac{1}{N-X}\sum_{l_{n}\notin\mathcal{O}_{*}}\mathcal{L}_{m}(l_{n},l^{t}_{\delta_{t}^{*}(n)})$ . The temporary mapping function based on the $t$ -th image $I_{t}$ is then defined as $g_{t}(s_{n})=s^{t}_{\delta_{t}^{*}(n)}$ .

The overall similarity score between $I_{t}\in\mathcal{T}^{*}$ and $I$ is obtained by $s(I_{t})=s_{1}(I_{t})-\lambda s_{2}(I_{t})$ , where $\lambda$ is a hyper-parameter chosen by experiment. We would like to note that score $s$ accounts for not only the victim objects reflecting by $s_{1}$ , but also the overall layout similarity incorporated in $s_{2}$ .

Afterwards, we find the $I_{t^{*}}$ as long as it 1) gives the highest similarity score and 2) matches more than 95 $\%$ of objects in $I$ . Consequently, the mapping function $g(s_{n})$ then equals to the temporary mapping function of the $t^{*}$ -th image, or $g_{t^{*}}(s_{n})$ . We refer the readers to Fig. 6 for more details.

3.3 Implementation of attack plan

To generate $\hat{\mathcal{O}}_{*}$ , evasion attacks can be implemented using our victim model itself under white-box setting or a single or multiple surrogate model(s) under zero-query black-box setting. In white-box scenario, our implementation of attack plan is based on TOG [13] for fair comparisons with existing methods [7](see Sec. 4). Specifically, we fix the weight of victim model $f$ and learns a perturbation image $\delta$ for $I$ by minimizing $\mathbf{L}(clip(I+\delta);\hat{\mathcal{O}}_{*})$ at every iteration [22]. $clip()$ is enforced to ensure bounded perturbation. Afterwards, the perturbed image $clip(I+\delta)$ is parsed to another unknown victim model, mimicking the zero-query black-box setting.

4 Experiment

To evaluate GLOW under various requests, we perform extensive experiments on coco2017val [33] and Pascal [18], with both white-box and black-box settings. As can be found in Tab. 1, our victim model $f$ can be Faster-RCNN-R50-FPN-1X-COCO( $\mathbb{F}$ ) [41] and F-RCNN+YOLO( $\mathbb{F}$ + $\mathbb{Y}$ ) [40] under white-box setting. These aforementioned victim models are later utilized as the surrogate model in our black-box attacks where DETR( $\mathbb{D}$ ) [8] and RetinaNet( $\mathbb{T}$ ) [32] are our victim models $f$ . Our black-box attack is zero-query based, meaning no feedback from victim model is available. Our GLOW is generally applicable to different victim detectors and we choose the aforementioned models mainly for efficiency and re-productivity purpose [7]. We report our performance under both perturbation budget 10 and 30. Due to the space limitation, we refer the readers to supplementary materials for results with the former and discussions on limitations. And our claims are valid with different perturbation budgets.

Baselines We compare GLOW with four baselines. To perform fair comparison, attack plan implementations are all obtained with TOG [13] thus we describe only the attack plan generation process in the following:

•

TOG [13] The attack plan generated by the TOG is context-agnostic, or $g(s_{n})=s_{n}$ . Victim object is given in R1 and will be randomly selected under R2 and R3.

•

TOG+RAND. TOG+RAND. focuses on both victim objects and other objects. Victim object is provided in R1 and randomly selected under R2 and R3. Mapping function $g(s_{n})$ is a random permutation function $r$ .

•

TOG+SAME. Attack plan generated by TOG+SAME. includes all objects. And we enforce $g(s_{n})=c_{p}$ , meaning all objects share the same target label $c_{p}$ .

•

Cai [7] Cai [7] can be directly apply to R1. As for R2 and R3, Cai [7] firstly selects random objects as victims and then generates the attack plan.

Evaluation Metrics We follow the basic metric from [5] and also introduce others for generic attack requests. Fooling rate (F) [5] is used to evaluate the attack performance on victim objects. Specifically, one attack succeeds if (1) victim object is perturbed as target label while IOU is score greater than 0.3 compared to GT and (2) it pass the co-occurrence check. And we define the fooling rate as the percentage of the number of test cases for which the above two conditions are satisfied. Besides, we further introduce T to measure the consistency on victim objects. T itself reveals the averaged $w_{p}(l_{p}^{*})$ . When combined with other metrics, T is satisfied as long as the averaged $w_{p}(l_{p}^{*})$ is above 0.02 (see Sec. 3.2). To measure the overall layout consistency, we introduce R that reflects the percentage of images whose maximum recall rate compared to $\mathcal{A}$ is above 0.5. We further design two metrics, E and C, on R2 and R3 to report successful rate. E checks whether target label $c_{p}$ exists in predictions. While C further verify the amount of $c_{p}$ . One attack is successful if both target labels and their amount satisfy the request in R3. We refer the readers to supplementary for more details of all metrics and give some visual examples in Fig. 7.

4.1 Main results

Attack performance on R1 We report our main results on conventional attack request R1 in Tab. 2 where perturbation budgets is set to 30. In general, we observe that under white-box setting, our F is comparable to existing methods on coco, which is reasonable as this metric considers only oc-occurrence matrix and both TOG+SAME and Cai [7] considers this semantic consistency. When considering global layout R, we observe clear performance improvement, or 30 $\%$ , over existing methods under all scenarios (R1-5, R1-50, R1-95), meaning that GLOW is able to not only fool the victim object, but also give more contextually consistent layout. Noticeably, our observation of 30 $\%$ averaged improvement is also valid under challenging zero-query black-box setting, which further demonstrates the transferbility of our proposed attack plan generation. Please note that results on Pascal are obtained with victim models that trained on coco17train, which showcases the generality and superiority of GLOW.

There are also other interesting observations in Tab. 2. Firstly, there exists a trend of performance improvement over all methods when compared R1-5, R1-50 and R1-95, indicating the selection of target label plays an important role in terms of performance. This trend validates our hypothesis that far-away labels, e.g. R1-5, are harder to attack compared to close-by ones, which in return proves the necessity of systematic design on target label rather than random generation. Secondly, though TOG+SAME simply assigns all labels of existing objects to be target label $c_{p}$ , it gives good performance under F. This observation further supports our design of more delicate consistency check metrics, e.g. R, as co-occurrence matrix is vulnerable to such simple hacks.

Attack performance on R2 The advantages of GLOW are more noticeable in R2 where victim object is requested to be localized by algorithm itself rather than being provided, as can be observed in Tab. 3. There are two main observations. Firstly, GLOW almost always beats the SOTAs in terms of all evaluation metrics under R2-5, R2-50 and R2-95 in white-box setting, e.g. about 35 $\%$ relative improvement compared to the second best in terms of F+T and E+R under R2-5. This observation is also valid when victim models are trained on coco17train and tested on Pascal. Interestingly, unlike R1 where victim object is fixed among all methods, results of T in R2 showcase that the victim object selection matters under generic request. Though neither TOG+SAME nor Cai [7] considers the overall layout consistency, the former gives better score compared to the later as it naively enforces all objects share the same target label and E in E+R measures only the existence of target label. Please note that E and F are different. For instance, assuming layout consistency is already satisfied, if one attack on the victim object fails but turns another object into target label, it will be regarded as success in E but failure in F. GLOW, again, produces superior results by leveraging layout explicitly. Our second observation from Tab. 3 is that GLOW has better transfer rates, such as 24 $\%$ improvement compared the context-aware baseline [7] and various types of random assignment under black-box setting, which further showcases the benefits of utilizing global layout in attack plan generation and the potential limitation of exploiting only semantic context. We observe the same trend that the overall performance improves when the target label is closer to presented labels in word space, supporting our design of various target label. We visualize some examples and results in Fig. 8. We observe that GLOW provides most reasonable semantic configurations given current scene layouts compared to all baselines.

Attack performance on R3 Results of most challenging request R3 are provided in Tab. 4. We kindly remind our readers that C+R and F+T+C reflect different aspects of an algorithm as the former does not care about specific objects but checks both target labels and their the amount. Assuming our R3 is to have two apples in victim image and our attacks are contextually consistent, F+T+C will be successful if two victim objects are perturbed to apple. In contrast, C+R reflects the amount of apples in perturbed images and mismatch in numbers would lead to failure. Again, our GLOW is a much safer choice in terms of R3 as it can almost always, or at least 12 out of all 18 entries, give the best performance with both white-box and black-box setting.

5 Conclusion

In this paper, we propose a novel attack generation algorithm GLOW for adversarial attacks on detectors. Compared to existing work, it explicitly takes both semantic context and geometric layout into consideration. By validating on two datasets, we demonstrate that GLOW produces superior performances under both conventional attack request and more generic ones where victim objects are obtained by estimation. GLOW also showcases better transfer rates under challenging zero-query black-box setting.

Bibliography55

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Ehud Barnea and Ohad Ben-Shahar. Exploring the bounds of the utility of context for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 7412–7420, 2019.
2[2] Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision. In European conference on computer vision , pages 549–565. Springer, 2016.
3[3] Sean Bell, C Lawrence Zitnick, Kavita Bala, and Ross Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2874–2883, 2016.
4[4] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision , pages 5561–5569, 2017.
5[5] Zikui Cai, Shantanu Rane, Alejandro E Brito, Chengyu Song, Srikanth V Krishnamurthy, Amit K Roy-Chowdhury, and M Salman Asif. Zero-query transfer attacks on context-aware object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15024–15034, 2022.
6[6] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: high quality object detection and instance segmentation. IEEE transactions on pattern analysis and machine intelligence , 43(5):1483–1498, 2019.
7[7] Zikui Cai, Xinxin Xie, Shasha Li, Mingjun Yin, Chengyu Song, Srikanth V Krishnamurthy, Amit K Roy-Chowdhury, and M Salman Asif. Context-aware transfer attacks for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 36, pages 149–157, 2022.
8[8] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision , pages 213–229. Springer, 2020.