Physical Adversarial Attacks on Deep Neural Networks for Traffic Sign   Recognition: A Feasibility Study

Fabian Woitschek; Georg Schneider

arXiv:2302.13570·cs.CV·March 10, 2023

Physical Adversarial Attacks on Deep Neural Networks for Traffic Sign Recognition: A Feasibility Study

Fabian Woitschek, Georg Schneider

PDF

Open Access

TL;DR

This study demonstrates the feasibility of physical adversarial attacks on traffic sign recognition DNNs using various black-box methods, highlighting safety concerns and the need for robust defenses in real-world applications.

Contribution

First to combine a general physical attack framework with multiple black-box methods and analyze their effectiveness under real-world conditions.

Findings

01

Physical attacks can reliably fool traffic sign DNNs in real environments.

02

Different attack methods vary in success rates and perceptibility.

03

Results emphasize the importance of developing defenses like adversarial training.

Abstract

Deep Neural Networks (DNNs) are increasingly applied in the real world in safety critical applications like advanced driver assistance systems. An example for such use case is represented by traffic sign recognition systems. At the same time, it is known that current DNNs can be fooled by adversarial attacks, which raises safety concerns if those attacks can be applied under realistic conditions. In this work we apply different black-box attack methods to generate perturbations that are applied in the physical environment and can be used to fool systems under different environmental conditions. To the best of our knowledge we are the first to combine a general framework for physical attacks with different black-box attack methods and study the impact of the different methods on the success rate of the attack under the same setting. We show that reliable physical adversarial attacks can…

Tables3

Table 1. TABLE I : Classification rates for all unlimited perturbations

Attack Type	Attack Parameters	Classification Rate Stop \ $%$	Classification Rate 60 \ $%$	SSIM White-Box
White-Box	-	$0.4$	$98.2$	$1.0$
	$s = 100$	$50.7$	$46.8$	$0.529$
Soft SPSA	$s = 500$	$14.1$	$81.1$	$0.543$
	$s = 2000$	$5.0$	$92.7$	$0.563$
	$h = 5$ & $s = 2000$	$12.1$	$84.3$	$0.513$
Hard SPSA	$h = 25$ & $s = 2000$	$7.6$	$89.8$	$0.527$
	$h = 50$ & $s = 2000$	$7.5$	$90.6$	$0.529$
	STN ID	$3.8$	$95.8$	$0.606$
Model Stealing	SN ID	$7.5$	$77.3$	$0.496$
Model Stealing	VGG11BN ID	$5.6$	$90.8$	$0.554$
	VGG11BN OOD	$4.2$	$94.8$	$0.563$

Table 2. TABLE II : Standard accuracy for MS

Architecture	Dataset	Accuracy GTSRB	Accuracy Synthetic
		\%	\%
STN	ID	$98.6$	$88.4$
SN	ID	$94.4$	$76.7$
VGG11BN	ID	$99.7$	$86.3$
VGG11BN	OOD	$95.2$	$79.9$
STN Original	GTSRB	$99.9$	$94.2$

Table 3. TABLE III : Classification rates for all limited perturbations

Attack Type	Classification Rate	Classification Rate
	Stop \%	60 \%
White-Box	$18.1$	$76.4$
$s = 2000$	$31.7$	$65.6$
$h = 50$ & $s = 2000$	$37.5$	$61.2$
STN ID	$26.1$	$70.6$
VGG11BN OOD	$38.9$	$56.3$
Manual	$18.9$	$46.9$
White-Box Unlimited	$0.4$	$98.2$

Equations12

δ arg min E_{t \sim T} adversarial L (f (t (x + M \cdot δ)), y^{*}) + λ_{T V} inconspicuous T V (M \cdot δ) + λ_{N P S} printable N P S (M \cdot (x + δ)) .

δ arg min E_{t \sim T} adversarial L (f (t (x + M \cdot δ)), y^{*}) + λ_{T V} inconspicuous T V (M \cdot δ) + λ_{N P S} printable N P S (M \cdot (x + δ)) .

T V (a) = r, c \sum ∣ a_{r + 1, c} - a_{r, c} ∣ + ∣ a_{r, c + 1} - a_{r, c} ∣ .

T V (a) = r, c \sum ∣ a_{r + 1, c} - a_{r, c} ∣ + ∣ a_{r, c + 1} - a_{r, c} ∣ .

N P S (a) = p \in P_{a} \sum \overset{p}{^} \in P_{p} \prod ∣ p - \overset{p}{^} ∣ .

N P S (a) = p \in P_{a} \sum \overset{p}{^} \in P_{p} \prod ∣ p - \overset{p}{^} ∣ .

g = \frac{\partial L ( f ( x ^{'} ) , y ^{*} )}{\partial x ^{'}} \approx \frac{1}{2 s α} i = 1 \sum s \frac{L ( f ( x ^{'} + α ξ _{i} ) , y ^{*} ) - L ( f ( x ^{'} - α ξ _{i} ) , y ^{*} )}{ξ _{i}} .

g = \frac{\partial L ( f ( x ^{'} ) , y ^{*} )}{\partial x ^{'}} \approx \frac{1}{2 s α} i = 1 \sum s \frac{L ( f ( x ^{'} + α ξ _{i} ) , y ^{*} ) - L ( f ( x ^{'} - α ξ _{i} ) , y ^{*} )}{ξ _{i}} .

L_{substitute} (x, y, f (\cdot)) = 1 - \frac{1}{h} i = 1 \sum h [f (x + β ζ_{i}) == y] .

L_{substitute} (x, y, f (\cdot)) = 1 - \frac{1}{h} i = 1 \sum h [f (x + β ζ_{i}) == y] .

x_{i + 1} = x_{i} - ϵ \cdot sgn (\nabla_{x_{i}} L (s_{i} (x_{i}), y^{*})) .

x_{i + 1} = x_{i} - ϵ \cdot sgn (\nabla_{x_{i}} L (s_{i} (x_{i}), y^{*})) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Cardiac Arrest and Resuscitation

Full text

Physical Adversarial Attacks on Deep Neural Networks for Traffic Sign Recognition: A Feasibility Study

Fabian Woitschek1 and Georg Schneider1 1The authors are with ZF Friedrichshafen AG, Artificial Intelligence Lab, Saarbrücken, Germany. E-Mail: [email protected]

Abstract

Deep Neural Networks (DNNs) are increasingly applied in the real world in safety critical applications like advanced driver assistance systems. An example for such use case is represented by traffic sign recognition systems. At the same time, it is known that current DNNs can be fooled by adversarial attacks, which raises safety concerns if those attacks can be applied under realistic conditions. In this work we apply different black-box attack methods to generate perturbations that are applied in the physical environment and can be used to fool systems under different environmental conditions. To the best of our knowledge we are the first to combine a general framework for physical attacks with different black-box attack methods and study the impact of the different methods on the success rate of the attack under the same setting. We show that reliable physical adversarial attacks can be performed with different methods and that it is also possible to reduce the perceptibility of the resulting perturbations. The findings highlight the need for viable defenses of a DNN even in the black-box case, but at the same time form the basis for securing a DNN with methods like adversarial training which utilizes adversarial attacks to augment the original training data.

I INTRODUCTION

Deep Neural Networks (DNNs) are increasingly used in real world applications, which includes their use in Advanced Driver Assistance Systems (ADAS). One example of such use case is the application of DNNs to the problem of Traffic Sign Recognition (TSR). However, it was shown that any machine learning system and especially DNNs are susceptible to small changes in the input data [1], which an adversary can use to fool the system by generating perturbations that are applied on the input data. Hence, it is relevant to explore whether similar attacks can be performed under realistic conditions, which would mean that systems deployed in reality have to be defended against such attacks.

Typical adversarial attacks directly manipulate the input of a DNN, which is hard to do for an adversary when attacking a real TSR system. First, the adversary does not have access to the TSR system that runs inside the vehicle and even if it has this access it would be impossible to calculate perturbations in real time. Hence, we consider an alternative and more realistic attack scenario where the adversary applies the perturbation in the physical environment, which is then captured by the ADAS-camera of the vehicle. In this case the perturbation must reliably fool the TSR system under different lighting conditions and viewpoints.

This attack scenario is the most dangerous one, since the adversary can impact the safety of many vehicles, simply by placing a perturbed traffic sign in the physical environment. However, calculating such perturbed traffic signs typically requires the adversary to have white-box access to the system, which is unrealistic. Hence, we look into existing work on black-box adversarial attacks and study whether physical attacks are also possible in this case where the adversary has only limited access to the system.

In Fig. 1 an overview over the different parameters of an adversarial attack is shown and our focus areas are highlighted. We only consider targeted attacks since these have the greatest potential to cause the highest actual damage in a realistic scenario. It is more severe if the adversary controls the exact output of a TSR system, meaning minor misclassifications (e.g. the system predicts a Speed Limit 20 sign instead of a Speed Limit 30 sign) are avoided.

Our contributions:

•

We incorporate different black-box attack methods into a general framework that can be used to perform physical adversarial attacks

•

We study how well each black-box attack method can be used to generate perturbations that reliably fool a TSR system under different conditions which simulate a realistic environment

•

The capabilities of each black-box attack method to generate perturbations with limited strength and reduced perceptibility are examined

•

Our findings show that physical attacks are possible even under strict black-box access, but also form the basis to protect DNNs in safety critical applications using defense methods like adversarial training [2]

II RELATED WORK

Adversarial attacks [1], where the resulting perturbation is applied in a physical environment, have first been published for general purpose object classifiers ([3, 4, 5]). Afterwards, similar attacks have been demonstrated for the specific use case of ADAS ([6, 7, 8]). However, these attacks assume white-box knowledge of a system.

In a concurrent line of work, methods have been developed to perform adversarial attacks in a black-box scenario where the access of the adversary to the system is limited. Here, the simplest methods are transfer-based attacks [1], where the perturbation is calculated for a known white-box system on a similar task and then used to attack the black-box system. To improve the success rate of such transfer-based attacks the use of special surrogate systems [9] or ensembles [10] has been proposed to perform the white-box attack. Alternatively, attack methods have been developed that attack the black-box system directly. These include approximation-based methods [11] and decision-based methods [12].

We do not utilize decision-based methods since they cannot be incorporated in the framework presented in III-A to calculate physical perturbations. For the transfer-based methods we only evaluate the approach that trains a local surrogate system which mimics the behavior of the black-box system. We choose this transfer-based method, since it may be more revealing to the adversary if the stealing of the black-box system succeeds (also for other tasks).

III PRACTICAL ADVERSARIAL ATTACKS

We first describe the method used for generating adversarial perturbations that can be used in a physical environment. This method is used to fool a classification system based on a DNN for the application of TSR, but it is methodically possible to use similar methods on other applications as well. At first, we assume complete white-box knowledge of the system that is attacked and later relax that constraint to the hard black-box case, where only the label of the class with the highest probability is returned by the system.

III-A Physical Attacks

To generate perturbations that fool a system when applied in a physical environment, we build on the RP2 algorithm [6]. This consists of an objective function to find a perturbation $\delta\sim[-1,1]^{D}$ for an input $x\sim[0,1]^{D}$ to a classifier system $f(\cdot)$ , such that the resulting perturbed input $x+\delta$ is classified as the adversary target class $y^{*}$ by the system:

[TABLE]

Here, $M\sim\{0,1\}^{D}$ is a mask that limits the area of the input where the perturbation can be applied to, $\lambda_{TV}$ and $\lambda_{NPS}$ are hyper-parameters that control the strength of the regularization of the perturbation and $t(\cdot)$ is a function that transforms the perturbed input based on the set of considered transformations $T$ . In the remaining of this work we refer to the case of $\lambda_{TV},\lambda_{NPS}=0$ (no regularization) as unlimited and to the case of $\lambda_{TV},\lambda_{NPS}\neq 0$ as limited. Further, $\mathcal{L}(\cdot)$ is a standard loss function that measures the difference between the prediction of the system and the target class $y^{*}$ . Since we study adversarial attacks on a classification system for TSR, we choose the Negative Log Likelihood (NLL) loss as $\mathcal{L}(\cdot)$ to express how well the perturbation $\delta$ fools the system.

Like [7] we use the Total Variation (TV) norm instead of the $\ell_{p}$ norm to regularize the strength of the perturbation during the optimization. If $r,c$ denote the row and column index the TV norm can be expressed as:

[TABLE]

As done in the original RP2 algorithm we also use the Non-Printability Score (NPS) [13] to account for reproduction errors when the digital perturbation is printed. If $P_{a}$ denotes the set of RGB triplets that exist in the perturbed input and $P_{p}$ denotes the set of RGB triplets that can be replicated by the printer the NPS term can be expressed as:

[TABLE]

The combined objective function in (1) is then optimized using standard gradient descent to result in a perturbation $\delta$ , where we use the ADAM optimizer [14] to perform this optimization throughout this work.

In order that the final perturbation can successfully fool a classifier system in the real world, it is important that the perturbation is robust to changes in the lighting conditions and viewpoints. To achieve this robustness different environmental conditions are modeled during the optimization. In each iteration the function $t(\cdot)$ applies a different transformation from a predefined set $T$ to the perturbed input, simulating changing conditions. Hence, we use the Expectation over Transformation method [15] to robustly optimize the perturbation for a range of environmental conditions.

III-B Black-Box Attacks

The described algorithm requires gradients to be computable for the system $f(\cdot)$ that should be attacked, meaning the adversary has white-box access to the system. In reality this constraint is typically hard to fulfill for an adversary, so it has to rely on black-box methods, where the internal behavior of the system does not have to be known.

III-B1 Gradient Approximation

The first black-box method we consider is based on the approximation of the true but unknown gradient $g$ . We use Simultaneous Perturbation Stochastic Approximation (SPSA) [16] to calculate an approximation of the true gradient, based on a two sided finite difference estimation with random directions. SPSA is used, since [17] has shown that SPSA is the most reliable compared to other gradient-free optimization techniques. Therefore, with noise images sampled from a Rademacher distribution $\xi_{1},\dots,\xi_{s}\sim\{-1,1\}^{D}$ the approximation of the true gradient $g$ can be expressed as:

[TABLE]

Here, $x^{\prime}=t(x+M\cdot\delta)$ is the perturbed and transformed input, $\alpha$ is the strength of the noise samples and $s$ is the amount of noise samples used. We use the resulting gradient estimation as a direct plug-in for the true gradient and perform the same optimization as previously.

However, (4) requires that the loss $\mathcal{L}(\cdot)$ can be computed, which requires the output of $f(\cdot)$ to be a vector of probabilities over all classes since we use the NLL loss. This constraint is often not met, since the system might only output the top-x classes without any confidence metric attached. Especially TSR systems that are used in reality, only give the user access to the top-1 class. Therefore, we now focus on the most difficult case and consider a system that only outputs the top-1 class without any further information. This is as little information as possible and we refer to this case as hard black-box (compared to soft black-box where the complete probability vector is output).

To adjust the procedure to this case it is required, that $\mathcal{L}(\cdot)$ only uses the top-1 class. Hence, we define a substitute loss, inspired by [11], which measures the robustness of a classification under the influence of noise. To this end we sample noise images from a uniform distribution $\zeta_{1},\dots,\zeta_{h}\sim\mathcal{U}(-1,1)^{D}$ and express the substitute loss as:

[TABLE]

Here, $y$ is the true class, $\beta$ is the strength of the noise samples and $h$ is the amount of noise samples used. To have meaningful approximations with this loss it is either required that the input $x$ is already mainly classified as the true class or a rather large noise strength has to be used (at least at the beginning), to by chance find the true class and iterate towards it over time.

III-B2 Model Stealing

Transfer-based black-box attacks [1] represent an alternative to gradient approximation-based attacks. Here the perturbation is generated using a white-box system and is then transferred to the black-box system. If the systems behave similarly the transfer rate is high and the black-box system is fooled by the perturbation. Hence, the adversary wants to have a white-box system that is as similar as possible to the black-box system. To achieve this, we use Model Stealing (MS) attacks to train a private surrogate system of the black-box system. Then this surrogate is attacked using the white-box attack from (1) and the resulting perturbation is used to attack the black-box system.

Regarding the access to the system we only consider the hard black-box case, since it is most challenging and realistic. The only information the adversary needs is the amount of output classes of the black-box system, since this is required to train the surrogate system. This knowledge can be obtained by queuing the black-box system on different inputs and observing the range of possible outputs. Further, the surrogate system can also have one additional class which functions as a placeholder, where all unexpected outputs from the black-box system are collected.

To train the surrogate we follow the basic approach introduced in [9]. First, for the surrogate a DNN architecture is chosen that roughly matches the expected expressivity of the black-box system. Then the black-box system is used to label an initial seed dataset and the surrogate is trained using these labels. Hence, the surrogate learns to mimic the labels of the black-box system on the presented data. Next, the dataset is augmented using an adversarial attack on the surrogate, which forces the new data samples to lie in a new decision region of the surrogate. Then the described label + train process is repeated, whereby the surrogate adapts the decision boundaries so that it outputs the same classes as the black-box system. This process is repeated for a certain amount of global iterations, through which the decision boundaries of the surrogate are moved closer and closer to the decision boundaries of the black-box system.

To perform the adversarial attack on the surrogate, [9] use the untargeted Fast Gradient Sign Method (FGSM) [18]. In contrast, we follow the approach from [19] which use the targeted Projected Gradient Descent Method (PGDM) [2]. This results in a higher quality of the surrogate because the data augmentation is more extensive and better represents the existing data space. The PGDM is an iterative expansion of the FGSM, which in case of a targeted attack adds the negative direction of the gradient to the current input $x_{i}$ :

[TABLE]

Here, $s(\cdot)$ is the surrogate system that is currently trained, $\epsilon$ is the step size of the attack and $\mathrm{sgn}$ denotes the signum function. For the PGDM the step size $\epsilon$ is divided into $K$ steps and the procedure is executed iteratively for $K$ iterations with an individual step size of $\epsilon/K$ . Hence, at each global iteration $i$ we perform a PGDM attack with a randomly drawn target class (that differs from the assigned label) on each data point in the current dataset. Therefore, the size of the dataset is doubled at each global iteration.

Using the described procedure, the quality of the surrogate depends among others on the dataset representing the initial seed data. In [9] the authors require the data to be roughly representative of the true domain of the input data of the black-box system. This requires that the adversary knows the input domain and that it can generate a small amount of In-Domain (ID) data samples. Whilst this is a realistic assumption for an adversary that attacks a TSR system, the authors in [20] perform a MS attack using only Out-Of-Domain (OOD) data by introducing a sampling procedure from large (unlabeled) general image datasets. We experiment with both ID and OOD data by using different datasets to represent the initial seed data, to also mimic the case where the adversary has close to no knowledge about the underlying input domain of the black-box system.

IV EXPERIMENTS

We perform numerous experiments to determine the feasibility of physical adversarial attacks on a TSR system based on the available system access. For each attack the starting image $x$ is a clean image of the associated traffic sign class, e.g. a raw Stop sign.

IV-A Setup

The DNN based TSR system is represented by a publicly available implementation [22] of a Spatial Transformer (ST) [23], which is the state-of-the-art on the German Traffic Sign Recognition Benchmark (GTSRB) [24]. However, in contrast to [22] we do not use data augmentation during test time and hence use only a single system instead of the ensemble. This is closer to a real TSR system, where no ensembles are used because of computational limitations. Still, the system achieves an accuracy of $>$ 99.9\text{,}\mathrm{\char 37\relax}$$ on the original test set.

For evaluating the success of an adversarial perturbation, we test whether the final perturbation fools the system under different environmental conditions. To this end we test the impact of each perturbation on the TSR system under $1000$ different transformations and report the associated classification rates. Concretely, we use a combination of the following transformations (which matches the set $T$ used during the optimization): rotation, perspective distortion, color jitter, scaling and background noise injection. We test the accuracy of the TSR system under the described transformations by generating $1000$ images per class in GTSRB where an image, containing only a raw traffic sign of the associated class, is synthetically transformed with a random combination of the transformations. On this synthetic dataset the system has an accuracy of $94.2\text{\,}\mathrm{\char 37\relax}$ , meaning this data is more challenging, but the system also classifies this synthetic data quite well. Hence, if we later observe a drop in accuracy after a perturbation is applied, it originates from the perturbation and is not by chance.

To compare the similarity of perturbations resulting from different methods we use the Structural Similarity Index Measure (SSIM) [21], which calculates a similarity value between two images in the range $[0,1]$ . Here, a value of [math] indicates that two images a very different and a value of $1$ indicates that the images a very similar or the same.

For the adversarial attack we initially focus on the unlimited case, where in (1) no mask $M$ is used and $\lambda_{TV},\lambda_{NPS}=0$ , since we first want to determine the differences in the quality of a perturbation generated with different system accesses. Later, in IV-E we also perform experiments in the limited case to generate perturbations that have a reduced perceptibility for the human visual system, but still reliably fool the TSR system.

IV-B White-Box Attacks

First, we evaluate the basic attack from (1) using the white-box access to the system to compute the required true gradients. In Fig. 2 two exemplary perturbations are shown for two different attack scenarios and the associated classification rates are given. For both scenarios the attack is highly successful and fools the TSR system in nearly all of the $1000$ transformations evaluated for each perturbed traffic sign. Hence, white-box attacks can be used to generate perturbations that can be applied in reality and fool a system under a variety of environmental conditions.

In the rest of this work we only present results for the attack scenario shown in 2(a), meaning the goal of the adversary is to generate a perturbation that fools the TSR system into predicting a Stop sign as a Speed Limit 60 sign. We also explored other attack scenarios (e.g. 2(b)) which behave similarly. However, the selected scenario is one of the hardest, since a human can easily distinguish the two traffic signs, since an octagonal shape is used exclusively for a Stop sign. Hence, a human would never be fooled by such a perturbed traffic sign, but the TSR system is very reliably fooled by the generated perturbation as demonstrated in 2(a).

IV-C Soft Black-Box Attacks

Next, we constrain the access of the adversary to the soft black-box case, where it still observes the complete probability vector of the system as output (typically from a softmax layer) but is unable to compute any gradients through the system. Also, the preprocessing (normalization, etc.) used by the system is unknown. We perform the same attack as for white-box access, only now approximating the true gradient with SPSA from (4).

In Fig. 3 exemplary images of resulting perturbations are shown and in Table I the associated classification rates are compared for all methods that are used for generating an unlimited perturbation. Additionally, the similarity of each perturbation is compared to the white-box perturbation using the SSIM. For the soft black-box attack based on SPSA we evaluate how many noise samples are required for an accurate approximation of the true gradient. One can observe in Fig. 3 that an increase in $s$ leads to a convergence to the result of the white-box attack (2(a)) and similar main regions are perturbed. Consequently, the SSIM in comparison to the white-box perturbation increases (Table I). This behavior transfers to the associated classification rates, where it is again possible to achieve high success rates under the variety of transformations if enough noise samples are used during SPSA. Even if only a very low number is used, the resulting perturbation fools the TSR system in $49.3\text{\,}\mathrm{\char 37\relax}$ of the cases, which is already enough to prevent a successful use of the TSR system in reality.

IV-D Hard Black-Box Attacks

As the last step towards the most limited access of an adversary we now evaluate how well the described methods perform in the hard black-box case. Hence, the TSR system now only outputs the top-1 class and no further information.

IV-D1 Gradient Approximation

We first test the attack based on SPSA, where we perform the same attack as previously, but now use the substitute loss from (5). For all attacks we use $s=$ 2000$$ and evaluate how many noise samples $h$ are required for an useful behavior of the substitute loss. The visual results are shown in Fig. 4 and the associated classification rates and SSIMs are again given in Table I. The behavior is similar to the soft black-box case, where an increase in the noise samples leads to a perturbation that converges to the white-box perturbation. Already a small amount of noise samples is sufficient to approximate the NLL loss and result in a perturbation that fools the system highly successfully. Summarizing, it is even possible to generate perturbations in the hard black-box case that fool the system reliably under various environmental conditions.

IV-D2 Model Stealing

As an alternative to a gradient approximation-based attack, we now evaluate a white-box attack with a preceding MS attack to generate the white-box surrogate system of the unknown hard black-box. We experiment with different architectures of the surrogate which includes the original architecture (STN), a VGG11 architecture [25] with batch normalization (VGG11BN) and a SqueezeNet architecture [26] (SN). For the last two architectures we use a version that is pre-trained on ImageNet [27] provided by [28]. Initially, we use ID seed data that is represented by the GTSRB [24] dataset, where we sample ten random images from each class to build the dataset, but do not use the assigned class labels as these are generated with the black-box system. We also test the hardest possible case of a MS attack, where the adversary has no information about the architecture of the system and the input domain. For this we use the VGG11BN architecture and use random images from the tiny-imagenet dataset [29] as seed data.

In Table II the accuracies of the trained surrogate systems are shown for the GTSRB dataset and our synthetically created dataset from IV-A. All surrogates have high accuracies, but if the architecture of the surrogate differs too much or if OOD data is used for seeding a drop in accuracy can be noted. Also, the accuracy on our synthetic dataset drops overall more than the accuracy on the GTSRB test dataset. Nevertheless, the surrogates mainly learn to classify traffic signs accurately, even when trained on OOD data, meaning the surrogate never sees a traffic sign during training.

Next, we evaluate whether the surrogate systems are similar enough to the black-box system, such that perturbations transfer between those systems. We use the white-box attack from (1) to generate perturbations on the surrogates and then use these perturbations to attack the black-box system. In Fig. 5 the resulting perturbations are visualized and the classification rates and SSIMs are again given in Table I. One can observe that perturbations which are more similar to the white-box perturbation (2(a)) also have a higher success rate of the targeted attack. Nevertheless, all perturbations still achieve good transfer rates, meaning all surrogates have decision boundaries roughly similar to the original system. However, using the same architecture as the black-box system leads to the best transfer rates, but an adversary typically does not have this knowledge. Interestingly, we achieve a higher transfer rate if OOD data is used than if ID data is used, although the general accuracy of that system is lower (Table II). An explanation is that the VGG11BN OOD system has a high accuracy (close to the level of STN Original) on images of Speed Limit 60 signs in both datasets. Hence, the system is worse in general accuracy, but learned the decision boundaries of the Speed Limit 60 class rather accurately which leads to an increased transfer rate.

IV-E Low Perceptibility Attacks

Our results so far show that physical adversarial attacks are possible even in the hard black-box case where the adversary has close to no knowledge of the system it attacks. However, the previous perturbations have a rather high strength and are obvious to a human observer. In the next step we generate perturbations that have a decreased perceptibility by introducing $\lambda_{TV},\lambda_{NPS}\neq 0$ . Hence, we perform the same attacks evaluated previously, but adjust the optimization to include regularization terms. For each different attack method, we only evaluate the best performing method.

The resulting perturbations are shown in Fig. 6 and the associated classification rates are given in Table III. In addition to the optimized perturbations we also include a simple manual perturbation that is derived by inspecting the most prominent regions in the other minimized perturbations and placing black rectangles by hand.

Comparing the perturbation from the unlimited white-box attack (2(a)) with the perturbation from the limited white-box attack (6(a)) one observes that the perturbation is now limited to the most susceptible regions. This reduces the visibility of the perturbation noticeably, but also leads to a reduced success rate of the targeted attack. There exists a trade-off between the perceptibility and the adversarial character of the perturbation that the adversary can optimize depending on the concrete use case. A similar behavior also exists for all black-box attack methods, where the resulting perturbations show large similarities with the white-box perturbation and the classification rates behave alike. All methods consider similar regions as most important and focus the perturbation to those areas. Hence, the approximation and stealing of the black-box system is successful and can be used to generate physical perturbations for hard black-box TSR systems that fool the system in a variety of environmental conditions.

For the approximation attack (6(b) and 6(c)) the perceptibility can be further reduced by applying a mask $M$ to the perturbations. In that way unwanted pixel artifacts resulting from the approximation can be excluded and only the main important regions of the perturbations remain. This has very little impact on the success of the attack and is a way for an adversary to approximate the white-box results further. Alternatively, the adversary can manually (similar to [6]) generate a simple perturbation which fools the system in roughly half the cases but is easiest to deploy in reality.

V CONCLUSION

We present results for the feasibility of physical adversarial attacks against a TSR system if the adversary has only limited access. Even in the most difficult hard black-box case, attacks are practicable and result in perturbations that reliably fool the system under different synthetic transformations, which simulate changing environmental conditions. It is further possible to reduce the perceptibility of the perturbations where a trade-off exists between the perceptibility and the adversarial character that the adversary must optimize depending on the concrete use case. The presented results highlight the need to secure systems that are deployed in reality against adversarial attacks and at the same time provide means to use defense methods like adversarial training.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Machine Learning , Beijing, China, June 2014.
2[2] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in International Conference on Learning Representations , Vancouver, Canada, April 2018.
3[3] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” in International Conference on Learning Representations: Workshop , Toulon, France, April 2017.
4[4] T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmor, “Adversarial patch,” in Conference on Neural Information Processing Systems , Long Beach, USA, December 2017.
5[5] S. T. K. Jan, J. Messou, Y.-C. Lin, J.-B. Huang, and G. Wang, “Connecting the digital and physical world: Improving the robustness of adversarial attacks,” in AAAI Conference on Artificial Intelligence , Honolulu, USA, April 2019.
6[6] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song, “Robust physical-world attacks on deep learning visual classification,” in IEEE Conference on Computer Vision and Pattern Recognition , Salt Lake City, USA, June 2018.
7[7] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, F. Tramèr, A. Prakash, T. Kohno, and D. Song, “Physical adversarial examples for object detectors,” in USENIX Workshop on Offensive Technologies , Baltimore, USA, August 2018.
8[8] C. Sitawarin, A. N. Bhagoji, A. Mosenia, P. Mittal, and M. Chiang, “Rogue signs: Deceiving traffic sign recognition with malicious ads and logos,” in IEEE Symposium on Security and Privacy: Deep Learning and Security Workshop , San Fransisco, USA, May 2018.