SATBA: An Invisible Backdoor Attack Based On Spatial Attention

Huasong Zhou; Xiaowei Xu; Xiaodong Wang; and Leon Bevan Bullock

arXiv:2302.13056·cs.CR·March 6, 2024

SATBA: An Invisible Backdoor Attack Based On Spatial Attention

Huasong Zhou, Xiaowei Xu, Xiaodong Wang, and Leon Bevan Bullock

PDF

Open Access

TL;DR

SATBA introduces a stealthy backdoor attack leveraging spatial attention and U-net architecture to embed triggers into images, achieving high success rates while evading detection and preserving model accuracy.

Contribution

This paper presents SATBA, a novel backdoor attack method that uses spatial attention and U-net to embed triggers invisibly, overcoming visibility and feature loss issues of prior attacks.

Findings

01

High attack success rate across multiple datasets

02

Robustness against backdoor defenses

03

Enhanced stealthiness demonstrated through image similarity experiments

Abstract

Backdoor attack has emerged as a novel and concerning threat to AI security. These attacks involve the training of Deep Neural Network (DNN) on datasets that contain hidden trigger patterns. Although the poisoned model behaves normally on benign samples, it exhibits abnormal behavior on samples containing the trigger pattern. However, most existing backdoor attacks suffer from two significant drawbacks: their trigger patterns are visible and easy to detect by backdoor defense or even human inspection, and their injection process results in the loss of natural sample features and trigger patterns, thereby reducing the attack success rate and model accuracy. In this paper, we propose a novel backdoor attack named SATBA that overcomes these limitations using spatial attention and an U-net based model. The attack process begins by using spatial attention to extract meaningful data features…

Tables3

Table 1. Table 1: Overview of the datasets used in the experiment.

Dataset	Input Size	Trainset	Testset	Classes
MNIST[7]	32 $$ 32 $$ 3	60000	10000	10
CIFAR10[17]	32 $$ 32 $$ 3	50000	10000	10
GTSRB[33]	32 $$ 32 $$ 3	39209	12630	43

Table 2. Table 2: Comparison of Attack Success Rate (ASR) and Clean Data Accuracy (CDA) of SATBA with other attacks on Alexnet, VGG16, and Resnet18. The ”Clean” column shows the accuracy of the original model on the clean dataset. The best results are in bold, and ∗ ∗ \ast indicates the same score as the best result. The second-best result is underlined.

Dataset→	MNIST		CIFAR10		GTSRB
Attack↓	CDA	ASR	CDA	ASR	CDA	ASR
BadNets[11] AlexNet	0.993	0.999	0.869	0.943	0.957	0.994
VGG16	0.994	1.000	0.886	0.967	0.963	0.995
Resnet18	0.996	1.000	0.898	0.961	0.962	0.997
Blend[6] AlexNet	0.992	1.000	0.885	0.996	0.960	0.999
VGG16	0.992	1.000	0.895	0.997	0.962	0.999
Resnet18	0.994	1.000	0.906	0.998	0.961	0.999
Clean AlexNet	0.992	——	0.889	——	0.961	——
VGG16	0.901	——	0.902	——	0.962	——
Resnet18	0.994	——	0.911	——	0.967	——
Refool[21] AlexNet	0.992	1.000	0.879	0.938	0.959	0.992
VGG16	0.991	1.000	0.892	0.953	0.963	0.992
Resnet18	0.994	1.000	0.901	0.954	0.961	0.995
Wanet[24] AlexNet	0.994	0.999	0.886	0.998	0.961	0.999
VGG16	0.995	1.000	0.893	0.999	0.965	0.999
Resnet18	0.995	1.000	0.903	0.999	0.960	0.999
SATBA(Ours) AlexNet	0.996	1.000 $*$	0.892	0.998	0.958	0.996
VGG16	0.995	1.000 $*$	0.888	0.998	0.965	0.999 $*$
Resnet18	0.996 $*$	1.000 $*$	0.904	0.999	0.955	0.994

Table 3. Table 3: Evaluation of attack stealthiness using PSNR, SSIM, MSE, and LPIPS compared to different attack methods. The best performance is highlighted in boldface and the second-best is underlined.

Dataset→	MNIST				CIFAR10				GTSRB
Attack↓	PSNR	SSIM	MSE	LPIPS	PSNR	SSIM	MSE	LPIPS	PSNR	SSIM	MSE	LPIPS
Badnets	24.0935	0.9874	253.3595	0.0013	30.8597	0.9935	89.0661	0.0015	27.6426	0.9914	151.2264	0.0076
Blend	15.9751	0.5637	1644.4315	0.0215	20.2685	0.7835	652.2303	0.0313	18.6797	0.6788	961.7501	0.0796
Refool	12.4355	0.4612	6023.4371	0.0734	17.2960	0.6920	1812.7980	0.0668	14.9439	0.5442	3440.4423	0.1578
Wanet	23.5286	0.9314	298.6892	0.0090	29.2407	0.9511	90.2150	0.0077	32.3926	0.960	78.3503	0.0863
SATBA	47.2693	0.9862	1.4735	0.0001	36.8021	0.9857	22.8828	0.0060	36.6371	0.9802	21.5731	0.0065

Equations12

D_{p o i so n e d} = D_{ba c k d oor} \cup D_{c l e an}

D_{p o i so n e d} = D_{ba c k d oor} \cup D_{c l e an}

\hat{F} (x_{i}) = y_{i}, \hat{F} (\overset{x_{i}}{^}) = y_{t a r g e t}

\hat{F} (x_{i}) = y_{i}, \hat{F} (\overset{x_{i}}{^}) = y_{t a r g e t}

t_{i} = G (x_{i}) = f_{i} ⊙ M_{i} = L (x_{i}) ⊙ R (\frac{1}{1 + e ^{F (\sum_{j = 0}^{N} O_{i}^{j})}})

t_{i} = G (x_{i}) = f_{i} ⊙ M_{i} = L (x_{i}) ⊙ R (\frac{1}{1 + e ^{F (\sum_{j = 0}^{N} O_{i}^{j})}})

x_{i} = I (x_{i} \oplus t_{i})

x_{i} = I (x_{i} \oplus t_{i})

t_{i} = E (x_{i})

t_{i} = E (x_{i})

L = λ_{1} L_{in} + λ_{2} L_{e x}

L = λ_{1} L_{in} + λ_{2} L_{e x}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Digital Media Forensic Detection

Full text

11institutetext: College of ComputerScience and Technology,

Ocean University of China, Qingdao, China

11email: [email protected]

SATBA: An Invisible Backdoor Attack Based on Spatial Attention

Huasong Zhou

Xiaowei Xu(✉) This research was partially supported by the National Key Research and Development Program of China under Grant 2020YFB1710005 and the Natural Science Foundation of Shandong Province under Grant ZR2022MF299.

Xiaodong Wang

Leon Bevan Bullock

Abstract

Backdoor attacks pose a new and emerging threat to AI security, where Deep Neural Networks (DNNs) are trained on datasets added to hidden trigger patterns. Although the poisoned model behaves normally on benign samples, it produces anomalous results on samples containing the trigger pattern. Nevertheless, most existing backdoor attacks face two significant drawbacks: their trigger patterns are visible and easy to detect by human inspection, and their injection process leads to the loss of natural sample features and trigger patterns, thereby reducing the attack success rate and the model accuracy. In this paper, we propose a novel backdoor attack named SATBA that overcomes these limitations by using spatial attention mechanism and U-type model. Our attack leverages spatial attention mechanism to extract data features and generate invisible trigger patterns that are correlated with clean data. Then it uses U-type model to plant these trigger patterns into the original data without causing noticeable feature loss. We evaluate our attack on three prominent image classification DNNs across three standard datasets and demonstrate that it achieves high attack success rate and robustness against backdoor defenses. Additionally, we also conduct extensive experiments on image similarity to highlight the stealthiness of our attack.

Keywords:

Backdoor Attack Deep Neural Network Spatial Attention U-Net.

1 Introduction

Various domains such as facial recognition[12], automated driving[37], medical diagnosis[25], etc. have witnessed the impressive performance of deep neural networks (DNNs) in the past decade. However, they also expose a serious weakness to adversarial attacks[10] that can tamper with the prediction output of DNNs by adding small noises to the input samples.

Backdoor attacks are a stealthy attack method against DNNs that has emerged with the advancement of attack techniques. In contrast to traditional adversarial attacks, backdoor attacks seek to insert a hidden backdoor in the training process of the DNNs, which enables the target DNNs to behave normally on clean samples but alter their output when the hidden backdoor is activated by the attacker’s designed input. This allows attackers to manipulate the DNNs.

Since the introduction of backdoor attacks, various approaches have been developed to implement them. Some methods directly modify clean data to infect it, such as Badnets[11], Blend[6] and SIG[2]. RDBA[42] recently proposed a raindrops-based attack that poisons clean images and demonstrates the potential threat of backdoor attacks induced by natural conditions in the physical world.

In spite of the improvement of backdoor attacks, most existing backdoor attack methods still suffer from the following major challenges: (1) The trigger incorporated in the clean image is static and fixed. That is, all poisoned images share the same trigger pattern. (2) The trigger is easy to identify and erase by defense methods and even humans because they are too conspicuous. (3) The features of images and trigger patterns are usually lost during the attack stage as a result of the modification of the clean image and trigger in the spatial domain.

This paper proposes SATBA, a new imperceptible backdoor attack on deep neural networks (DNNs) that utilizes spatial attention[23]. Our attack involves three steps: (1) extracting image features by a traditional algorithm such as HOG[27]or LBP[19], (2) obtaining the spatial attention matrix of the victim model on clean images and creating trigger patterns based on it, and (3) employing a U-shaped convolutional neural network to embed the trigger into clean images and launching attacks on the training process of the targeted model. We test our attack on benchmark datasets and DNNs and prove that it is effective and reliable, as it attains a high attack success rate (ASR), preserves a high clean data accuracy (CDA), and exhibits low anomaly index and high stealthiness. We summarize the main contributions of this paper as follows:

This paper presents the SATBA attack, the first attempt to use spatial attention mechanisms to create trigger patterns and install backdoors into DNN models.

-

A U-Net based network is designed for injecting trigger patterns into clean images with minimal feature loss. The network preserves both the clean image and the trigger pattern features during the injection process.

-

Extensive experiments show that SATBA outperforms several conventional backdoor attack methods in terms of attack success rate, robustness, and stealthiness, demonstrating the power and versatility of our approach to neural network security.

The remainder of this paper is structured as follows. In Section 2, we briefly review the related works on U-Net, attention mechanisms and backdoor attacks for image classification. In Section 3, we present the details of our proposed backdoor attack method. Our experimental results are reported and analyzed In Section 4. Finally, we conclude this paper in Section 5.

2 Related Work

2.1 U-Net

U-Net[28] is a U-shaped Fully Convolutional Network (FCN) that was originally proposed by Ronneberger et al. for medical image segmentation[3]. It achieved state-of-the-art results in the ISBI Cell Tracking challenge 2015[32]. Since then, U-Net has become a popular and versatile network structure for various tasks. For instance, Zhou et al.[43] introduced a nested structure of convolutional layers connected through skip connections to model multi-scale representations. Oktay et al.[26] combined U-Net with an attention mechanism to increase the sensitivity of the model to foreground pixels. This model was named Attention U-Net. Xiao et al.[40] explored the implementation of residual blocks[13] in U-Net and found that they improved the convergence speed of the network training compared to the original U-Net. They called their model Deep U-Net. Most recently, Chen et al.[5] recently proposed TransUNet, a novel deep network that integrated Vision Transformer (ViT)[9] into U-Net. This model aimed to address the limitation of U-Net in modeling long-range dependencies. TransUNet was one of the first works that applied ViT to the U-Net architecture.

Skip connection[3] is a key component of U-Net’s structure that enables the network to leverage the information from low-level convolutional layers, which can, to a certain extent, compensate for the feature loss induced by convolution operations. Inspired by this benefit, we incorporate U-Net into our trigger injection network to tackle the problem of representation loss that occurs during the poisoned image generation process for clean images and trigger patterns.

2.2 Attention Mechanism

In fact, Attention Mechanism were originally proposed for Computer Vision (CV) tasks. However, they gained popularity after the work of Mnih et al.[23], who combined visual attention with an RNN model for image classification tasks. Subsequently, Bahdanau et al.[1] applied Attention Mechanism to Natural Language Processing (NLP)[4, 8] for the first time. Vaswani et al.[30] introduced self-attention for text representation learning. Since then, attention mechanisms have been widely used.

Attention mechanism for computer vision can be broadly categorized into four types: channel attention, spatial attention, temporal attention, and branch attention. Spatial attention refers to the ability of a model to selectively focus on specific regions of an image. One of the earliest works on spatial attention was the Recurrent Attention Model (RAM) by Mnih et al.[23], which used recurrent neural networks (RNNs)[22] and reinforcement learning (RL) to learn where to attend. Another influential work was the Spatial Transformer Network (STN) by Jaderberg et al.[15], which incorporated a trainable module that could explicitly warp the important regions of input image. More recently, Dosovitskiy et al.[9] proposed the Vision Transformer (ViT), which applied the transformer architecture[35] originally designed for natural language processing to image classification tasks.

Spatial attention can be exploited to locate the important regions of interest in an image for a given victim model. By producing and embedding backdoor triggers into these regions, we can theoretically enhance the performance and robustness of the attack. This is because different images may have different attention regions for different target models, and therefore the trigger pattern can be more flexible, i.e., the trigger can vary depending on the sample and the target model.

2.3 Backdoor Attack

Backdoor attacks in deep neural networks (DNNs) were first introduced by Gu et al.[11], who inserted a small patch into clean images and used the poisoned images to train the target DNNs. This attack, also known as Badnets or Patched attack, could manipulate the behavior of the trained model when exposed to specific images with the trigger pattern. Chen et al.[6] suggested a different strategy for trigger generation and injection, using a Hello Kitty image as a trigger and optimizing the weights between benign and trigger images. This method, also called Blend, overlaid the trigger image on the clean image to produce poisoned images. Liu et al.[20] created poisoned data by performing reverse projection to fragile neurons in DNNs. Their trigger was universal but static. Turner et al.[34] examined the clean-label backdoor attack, which could achieve a high attack success rate without altering the target label of backdoor samples. Liu et al.[21] proposed Refool, which exploited physical reflection in daily life to construct a reflection model and use the reflected image of an object as a trigger, increasing the concealment of the trigger. Tuan et al.[24] designed a backdoor attack based on image distortion and introduced a novel training method called noise mode, which further narrowed the visual gap between poisoned and clean samples. Zhao et al.[42] applied a backdoor attack via raindrops, indicating the potential threat of backdoor attacks caused by natural conditions in the physical world.

The main goal of most existing backdoor attacks is to craft the poisoned image to resemble the original image. However, the triggers used in these attacks are often conspicuous and perceptible to human vision. Furthermore, they are independent of the sample itself and can be detected and removed by most defense methods. Unlike these attacks, our proposed backdoor attack relies on the clean sample itself and injects trigger by a U-Net model, which is stealthier and more effective than many other attacks.

3 Method

In this section, we shall first explicate the concept of a backdoor attack. Next, we provide an overview of the SATBA method, and subsequently, we present our novel approach for executing backdoor attacks that relies on spatial attention.

3.1 Definition of Backdoor Attack

The primary objective of this research is to investigate the efficacy of backdoor attacks on image classification neural networks. We define $\mathcal{X}$ as the input image domain and $\mathcal{C}$ as the corresponding ground-truth label set of $\mathcal{X}$ . A deep neural network $\mathcal{F}$ is trained to maps the input space to the label space using the dataset $D_{train}=\{\left(x_{i},y_{i}\right)\}$ where $x_{i}\in\mathcal{X},y_{i}\in\mathcal{C}$ and $i=1,\ 2,\ \ldots,\ N$ . For Backdoor Attacks, the attacker carefully selects clean images from $D_{train}$ and generates poisoned samples. A DNN model $\hat{F}$ then is trained with a poisoned dataset $D_{poisoned}$ , which consists of backdoored images $D_{backdoor}$ , a subset of $D_{train}$ , and the remaining clean images $D_{clean}$ , i.e.,

[TABLE]

Accordingly, the poison rate is defined as $\eta=\frac{\left|D_{backdoor}\right|}{\left|D_{train}\right|}$ . As a consequence, the poisoned model $\hat{F}$ works as expected on the pristine dataset $D_{clean}$ but outputs malicious predictions when triggered by a poisoned image, that is,

[TABLE]

where $x_{i}\in D_{clean},y_{i}\in\mathcal{C},\hat{x_{i}}\in D_{backdoor},y_{i}\not=y_{target}$ . The ground-truth label of $x_{i}$ and the attack’s target label are $y_{i}$ and $y_{target}$ , respectively.

3.2 Attack Pipeline

Fig. 1 illustrates the pipeline of our attack. First, The Trigger Generation module takes a clean image as an input and generates a trigger related to it. The Injection Model then produces the poisoned image by adding the trigger to a specific location of the clean image, which it takes as an input along with the trigger. After that, we train a victim deep model on the poisoned dataset that contains poisoned images from the previous process. This leads to the successful injection of the backdoor into the target model when the training process ends.

3.3 Our Proposed Attack SATBA

Our work is based on the assumption that the attacker possesses detailed information about the targeted deep neural network (DNN) and aims to develop a function for generating triggers using spatial attention, denoted as $\mathcal{G}$ . Our approach begins with feature extraction from a clean sample using a pretrained benign model with an identical architecture to that of the target model. Next, the clean image is passed through the pretrained model to obtain the feature maps, which are used to calculate spatial attention maps for each feature map. These maps are then consolidated into a single spatial attention matrix $\mathcal{M}$ . Finally, we compute the dot product between the image feature $f_{i}$ and $M_{i}$ yielding the trigger $t_{i}$ . More precisely, we employ the spatial attention mapping method proposed in PFSAN[38] for our experiments. To generate the trigger $t_{i}$ for a given clean image $x_{i}$ , we utilize the following equation:

[TABLE]

Here, $L\left(\cdot\right)$ is the feature extraction function, $O_{i}^{j}$ represents the feature map of each convolution layer in the pretrained model, $R\left(\cdot\right)$ denotes the reshape operation and $F\left(\cdot\right)$ indicates flatten operation. A schematic illustration of this procedure can be found in Fig. 2.

Subsequently, we concatenate the clean image $x_{i}$ with its corresponding trigger pattern $t_{i}$ and pass it to Injection Model to obtain the poisoned image $\widehat{x_{i}}$ . The mathematical representation of this process is:

[TABLE]

where $I\left(\cdot\right)$ refers to the injection function, and $\oplus$ represents the concatenate operation.

In order to maintain the stealthiness of the backdoored image and minimize the feature loss of the clean image and its trigger, we employ an Extraction Model to recover the trigger pattern from the poisoned image obtained from the Injection Model. Finally, we can obtain the trigger image reconstructed from the poisoned image. Thus, this process can be expressed by the following equation:

[TABLE]

where $E\left(\cdot\right)$ represents the extraction function and $\widehat{t_{i}}$ denotes the reconstructed trigger image. We design a trigger injection network based on U-Net architecture which utilizes skip connection to compensate for the loss of features during the injection process. The output of the Injection Model is adjusted to have the same dimensions as the clean image after up-sampling and down-sampling of the concatenated clean image and trigger image.

3.4 Loss Function

We aim to achieve invisible hiding of the trigger pattern in the clean image by optimizing the loss between the clean and poisoned image. Additionally, we introduce a trigger loss to maintain the trigger features during the injection process. The overall loss function is given as:

[TABLE]

Here, $\mathcal{L}_{in}$ means the loss between the clean image and the poisoned image, while $\mathcal{L}_{ex}$ represents the loss between the original trigger image $t_{i}$ and the reconstructed trigger image $\widehat{t_{i}}$ . The hyper-parameters $\lambda_{1}$ and $\lambda_{2}$ are used to control the contribution of each loss term. Fig. 3 presents the complete structure of our architecture.

4 Experiment

The experimental setup is explained in this section, followed by the assessment of the effectiveness of our attack. We also analyze the robustness, invisibility, and poison rate impact of SATBA.

4.1 Experiment Setup

In our experiments, we evaluated the performance of our attack on three standard datasets: MNIST[7], CIFAR10[17], and GTSRB[33] using three popular deep models: AlexNet[18], VGG16[31], and ResNet18[13]. We resized all images in the datasets to (32 $\ast$ 32 $\ast$ 3) and normalized them to [0, 1]. To generate the poisoned datasets $D_{poisoned}$ , we randomly selected image samples from each class for the three datasets using a poison rate of $\eta=0.1$ . The clean images were then replaced with their corresponding poisoned samples. Table 1 provides additional details about the datasets used in our experiment.

During training of the victim model, we set the learning rate to 0.1 and schedule it to decrease by a factor of 0.1 every 50 epochs, using SGD[29] optimizer on 200 epochs. To achieve a balance between the Injection Model and Extraction Model, we found that the poisoned image and the trigger image perform well if we set $\lambda_{1}$ and $\lambda_{2}$ to 0.5 and 1.0 respectively. The trigger injection and extraction networks are trained using the Adam[16] optimizer for 150 epochs, with a learning rate of 0.001. The learning rate is gradually decreased by a factor of 0.5 when the validation loss of the model has not reduced in the previous 3 epochs.

4.2 Attack Performance

The performance of our attack is evaluated using Attack Success Rate (ASR) and Clean Data Accuracy (CDA). ASR measures the effectiveness of test samples with triggers that are successfully predicted to be the target label $y_{target}$ , while CDA indicates the accuracy of the infected model on the clean test dataset. We evaluate the effectiveness of our proposed SATBA attack by comparing it with four conventional backdoor attacks, namely Badnets[11], Blend[6], Refool[21], and Wanet[24]. To ensure a fair evaluation, we adopt an All-to-One attack strategy in which all poisoned images are labeled with the same target label $y_{target}$ (class 0). The results are presented in Table 2, which includes the ASR and CDA of different backdoor attacks on the three standard image classification datasets and DNNs. Our SATBA attack successfully poisons deep models by injecting only a small portion of the training set and achieves a higher ASR compared to other backdoor attacks. Specifically, SATBA achieves the highest ASR on CIFAR10 while preserving a high CDA, surpassing the results of Badnets[11], Blend[6], Refool[21], and Wanet[24]. For GTSRB, our proposed attack performs well on Alexnet, VGG16, and Resnet18, obtaining comparable ASR scores to the best result. Moreover, in MNIST, our approach outperforms others in both CDA and ASR. Meanwhile, the SATBA backdoor attack does not cause a substantial decrease in the validation accuracy of the infected model on clean datasets, and even shows improvement in some cases. While SATBA’s ASR and CDA may not always significantly exceed those of other attacks, it is sufficient to conduct a backdoor attack against the victim model.

4.3 Defense Resistance

We assessed the effectiveness of our proposed SATBA attack against backdoor defense using the Neural Cleanse (NC)[36] method. NC generates potential triggers for each class of the model being tested and calculates an Anomaly Index for them, with a higher Anomaly Index indicating a greater likelihood of a backdoor being embedded in the DNN. When the Anomaly Index is greater than 2, with a baseline of 2, NC considers a deep model to contain a backdoor. As depicted in Fig. 4, NC was unable to detect the backdoor model injected by SATBA, confirming its ability to evade backdoor defense. Furthermore, our model had a lower Anomaly Index compared to other DNNs trained using common backdoor attacks, indicating that SATBA has greater resilience to backdoor defense.

4.4 Stealthiness Analyze

Fig. 5 present a visual comparison of poisoned images and their triggers generated by different backdoor attack methods on the GTSRB dataset. In contrast to Badnets[11], Blend[6], Refool[21], and Wanet[24], the poisoned image created by SATBA appears more natural and closely resembles the clean image, making it less detectable by humans. Moreover, its corresponding trigger is more relevant to the clean image and is imperceptible, which is crucial for ensuring attack stealthiness.

To quantitatively evaluate the similarity between the clean image and the poisoned image generated by different attacks, we measure the peak-signal-to-noise-ratio (PSNR)[14], structural similarity index (SSIM)[39] mean square error (MSE), and learned perceptual image patch similarity (LPIPS)[41]. LPIPS measures similarity based on features learned by a pretrained Alexnet, while PSNR, SSIM and MSE compute similarity based on pixel-level statistics. The stealthiness metrics we use are related to the degree of similarity between the clean and poisoned images. Specifically, higher PSNR and SSIM scores indicate greater similarity, while lower MSE and LPIPS scores suggest better invisibility of the poisoned image. We conducted experiments to evaluate the stealthiness of SATBA on MNIST, CIFAR10, and GTSRB datasets by randomly selecting 1000 images from the poisoned test set. As shown in Table 3, our proposed attack achieved excellent scores in all similarity metrics, including the highest PSNR and lowest MSE values for all three datasets. Although the SSIM of Badnets was better than ours, our attack was very close to the best result. In terms of LPIPS, our SATBA showed significant improvement on MNIST and GTSRB compared to Badnets, Blend, Refool. and Wanet. Notably, while the LPIPS of Badnets was lower than that of SATBA, our attack achieved the second-best result and was the closest to Badnets among all attacks.

4.5 Poison Rate

To examine how the poison rate affects the attack success rate, we compared the performance of our attack on different datasets with Resnet18. The results, as presented in Fig. 6, demonstrate that our attack achieves a high attack success rate while maintaining a stable test accuracy on all three standard datasets. Specifically, with only 1% of training images poisoned, SATBA achieves nearly 100% ASR on MNIST. For CIFAR10 and GTSRB, the proposed attack performs well with ASR greater than 0.92 when the poison rate is over 0.02. Additionally, the victim model’s Clean Dataset Accuracy remains in a normal range, even higher than Clean’s one, with no distinguishable difference from a clean DNN (less than 0.04). This experiment validates the effectiveness of our SATBA without sacrificing the accuracy of the poisoned model on the clean dataset.

5 Conclusion

This paper presents a new technique for creating invisible backdoor attacks on deep neural networks (DNNs). Our approach involves using spatial attention to identify the focus area of a victim model on clean data and generating a unique trigger corresponding to that sample. A U-type model is then employed to produce poisoned images while optimizing the feature loss of both the images and triggers. Our experimental results show that our proposed method is highly effective, generating imperceptible poisoned images that are able to successfully attack DNNs. We believe that our work can aid in the advancement of more robust and secure image classification neural networks.

Our future work will focus on investigating the transferability of our trigger, examining whether a trigger generated from one dataset-DNN pair can successfully attack other models and triggers. Additionally, we intend to further optimize the performance of our approach by incorporating resnet blocks into our trigger injection network, thus enhancing the feature extraction capabilities of our model.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473 (2014)
2[2] Barni, M., Kallas, K., Tondi, B.: A new backdoor attack in cnns by training set corruption without label poisoning. In: 2019 IEEE International Conference on Image Processing (ICIP). pp. 101–105. IEEE (2019)
3[3] Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin-unet: Unet-like pure transformer for medical image segmentation. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III. pp. 205–218. Springer (2023)
4[4] Chan, A.: Gpt-3 and instructgpt: technological dystopianism, utopianism, and “contextual” perspectives in ai ethics and industry. AI and Ethics pp. 1–12 (2022)
5[5] Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: Transunet: Transformers make strong encoders for medical image segmentation. ar Xiv preprint ar Xiv:2102.04306 (2021)
6[6] Chen, X., Liu, C., Li, B., Lu, K., Song, D.: Targeted backdoor attacks on deep learning systems using data poisoning. ar Xiv preprint ar Xiv:1712.05526 (2017)
7[7] Deng, L.: The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine 29 (6), 141–142 (2012)
8[8] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805 (2018)