ArtiFact: A Large-Scale Dataset with Artificial and Factual Images for   Generalizable and Robust Synthetic Image Detection

Md Awsafur Rahman; Bishmoy Paul; Najibul Haque Sarker; Zaber Ibn Abdul; Hakim; Shaikh Anowarul Fattah

arXiv:2302.11970·cs.CV·February 27, 2023

ArtiFact: A Large-Scale Dataset with Artificial and Factual Images for Generalizable and Robust Synthetic Image Detection

Md Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, Zaber Ibn Abdul, Hakim, Shaikh Anowarul Fattah

PDF

Open Access 1 Repo

TL;DR

This paper introduces ArtiFact, a comprehensive large-scale dataset with diverse synthetic and real images, and proposes a robust detection method that generalizes well to unseen generators and real-world impairments.

Contribution

The paper presents ArtiFact, a new large-scale dataset for synthetic image detection, and a multi-class classification approach with a filter stride reduction strategy to improve generalization.

Findings

01

Outperforms other methods by up to 15.08% in accuracy on challenging tests.

02

Effectively detects synthetic images from unseen generators.

03

Robust against real-world impairments and social platform distortions.

Abstract

Synthetic image generation has opened up new opportunities but has also created threats in regard to privacy, authenticity, and security. Detecting fake images is of paramount importance to prevent illegal activities, and previous research has shown that generative models leave unique patterns in their synthetic images that can be exploited to detect them. However, the fundamental problem of generalization remains, as even state-of-the-art detectors encounter difficulty when facing generators never seen during training. To assess the generalizability and robustness of synthetic image detectors in the face of real-world impairments, this paper presents a large-scale dataset named ArtiFact, comprising diverse generators, object categories, and real-world challenges. Moreover, the proposed multi-class classification scheme, combined with a filter stride reduction strategy addresses social…

Tables3

Table 1. Table 1 : Ablation study of the proposed method.

Method	Accuracy
Binary-class	78.21
Binary-class + FSR	81.30
Multi-class	83.12
Multi-class + UF class	84.98
Multi-class + FSR	85.56
Multi-class + FSR + UF class	87.62

Table 2. Table 2 : Accuracy (%) of Top3 Teams in IEEE VIP Cup 2022

Team Names	Test 1	Test 2	Test 3
Sherlock	87.70	77.52	73.45
FAU Erlangen-Nürnberg	87.14	81.74	75.52
Megatron (Ours)	96.04	83.00	90.60

Table 3. Table 3 : Comparison of the proposed method with existing approaches.

Method	Accuracy
Joel et al. [3]	63.19
Francesco et al. [6]	79.28
Wang et al. [11]	79.95
Gragnaniello et al. [12]	81.63
Multi-class + FSR + UF class (ours)	87.62

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

awsaf49/artifact
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis · Advanced Steganography and Watermarking Techniques

MethodsTest · Convolution

Full text

1 Dept. of EEE, BUET, Bangladesh

2 Dept. of CSE, BUET, Bangladesh

ArtiFact: A Large-Scale Dataset with Artificial and Factual Images for Generalizable and Robust Synthetic Image Detection

Abstract

Synthetic image generation has opened up new opportunities but has also created threats in regard to privacy, authenticity, and security. Detecting fake images is of paramount importance to prevent illegal activities, and previous research has shown that generative models leave unique patterns in their synthetic images that can be exploited to detect them. However, the fundamental problem of generalization remains, as even state-of-the-art detectors encounter difficulty when facing generators never seen during training. To assess the generalizability and robustness of synthetic image detectors in the face of real-world impairments, this paper presents a large-scale dataset111The dataset is available at https://github.com/awsaf49/artifact named ArtiFact, comprising diverse generators, object categories, and real-world challenges. Moreover, the proposed multi-class classification scheme, combined with a filter stride reduction strategy addresses social platform impairments and effectively detects synthetic images from both seen and unseen generators. The proposed solution significantly outperforms other top teams by 8.34% on Test 1, 1.26% on Test 2, and 15.08% on Test 3 in the IEEE VIP Cup challenge at ICIP 2022, as measured by the accuracy metric.

§§footnotetext: Equal contribution

**Index Terms— ** Synthetic Image, Robust Detection, Multi-class classification, Generative Models

1 Introduction

With the advent of deep learning technologies, an array of new methods have been introduced for synthetic image generation. These improvements have opened up new and exciting opportunities in creative arts, the entertainment industry, and advertising. But they are also posing a threat in regard to privacy, authenticity, and security—for example, generating fake images of a subject in different contexts.

To prevent these types of illegal and detrimental activities, developing technologies to detect these fake images have paramount importance. Most generative architectures have unique patterns in their synthetic images, which are absent in real images and vary with both the generator architecture and the dataset it is trained with [1]. Recent works on synthetic image detection exploited these artifacts present in generators utilizing color band correlation, intensity, Fourier spectra characteristics [2, 3, 4, 5]. For the detection of these patterns and artifacts, methods based on hand-crafted features and frequency domain analysis are outperformed by very deep CNN models [6].

The most fundamental problem that still remains is generalization, where even state-of-the-art detectors encounter difficulty when facing generators never seen during training [5]. With the rapid development of sophisticated generative models, it is impossible to include all possible variants in a detector’s training set. This is further compounded by the prevalence of image impairment in compressed and resized social media images, which are particularly vulnerable to synthetic image-related fraudulent activities. The development of a viable solution is hindered by the lack of a proper benchmark dataset as existing relevant datasets are limited by a lack of diversity among generator sources and object classes. This necessitates the creation of a dataset featuring both real and fake images from diverse generators and object categories, with real-world challenges.

Given the foregoing context, this study provides the following significant contributions: 1) A large-scale dataset namely ArtiFact is proposed, replete with diverse generators, object categories, and real-world impairments, poised to assess the efficacy of synthetic image detectors across a vast spectrum of sources and categories. 2) To address the generalizability problem of detecting fake images from previously unseen generators, while also addressing real-world impairments that can impact the robustness of image detectors, a multi-class classification scheme along with filter stride reduction strategy is proposed for both generalizable and robust synthetic image detection.

2 Related Work

Deep learning-based generative models have revolutionized image synthesis in recent years. After the breakthrough in image synthesis by generative adversarial networks(GANs), the basic GAN framework has been extended for more diverse generations [7, 8]. Recent diffusion models [9] have shown impressive results in generating high-quality images with realistic textures and details.

Generative models have unique patterns that can be used to attribute them [10]. Despite ongoing efforts to reduce these patterns in generators, even recent proposals such as Diffusion Models are not free from them [5]. Color band inconsistencies [2] or lack of variation in color intensity [4] can act as unique identifiers for generative models. Fourier domain analysis also reveals unique signatures of generative models that can be used for model attribution [3]. Marra et al. [6] shows that pre-trained state-of-the-art CNN image classification models outperform CNN models specifically designed from scratch for this task. Recent works on synthetic model attribution utilize supervised training of pre-trained CNN models with augmented data [11, 12]. Although they have shown promising results in detecting generators whose images are included in the training dataset, these methods have difficulty generalizing to unseen generators and also are susceptible to image impairments.

Existing datasets for this task are either limited in the number of diverse generators or object categories. For example, a dataset comprising five classes, including one real and four synthetic classes, is introduced by [10]. However, this dataset only includes GAN models without any accompanying image data. Bui et al. [13] introduces a dataset with ten classes, comprising eight generator classes and two real classes. But this dataset still suffers from a lack of diversity of generators and object categories. Therefore, the need for more diverse and comprehensive datasets for synthetic image detection persists.

3 Methodology

3.1 Proposed ArtiFact Dataset

The challenge of evaluating the performance of synthetic image detectors in terms of their generalizability and robustness requires a comprehensive dataset that meets specific requirements. These requirements include 1) a diversity of generators, including GAN, Diffusion, fully manipulating, and partially manipulating, 2) a diversity of object categories, encompassing various types rather than a few types, and 3) reflecting real-world scenarios by incorporating impairments resulting from social platforms. However, current datasets lack these features, limiting the ability to evaluate the detectors fully, and providing only a partial view of their robustness and generalizability. To address this issue, a large-scale dataset namely ArtiFact (Artificial and Factual), has been proposed which integrates diverse generators, object categories, and real-world impairments, providing researchers with a more comprehensive understanding of synthetic image detectors’ generalizability and robustness.

3.1.1 Dataset Characteristics

To include a diverse collection of real images from multiple categories, including Human/Human Faces, Animal/Animal Faces, Places, Vehicles, Art, and many other real-life objects, the proposed dataset utilizes 8 sources [7, 14, 15, 16] that are carefully chosen. Additionally, to inject diversity in terms of generators, the proposed dataset synthesizes images from 25 distinct methods [7, 8, 9, 14, 17, 18, 16, 19, 15, 20, 21, 22, 23, 24]. Specifically, it includes 13 GANs, 7 Diffusion, and 5 other miscellaneous generators. On the other hand, in terms of syntheticity, there are 20 fully manipulating and 5 partially manipulating generators, thus providing a broad spectrum of diversity in terms of generators used. The distribution of real and fake data with different sources is shown in Fig.1 and Fig.2, respectively. The dataset contains a total of 2,496,738 images, comprising 964,989 real images and 1,531,749 fake images. The most frequently occurring categories in the dataset are Human/Human Faces, Animal/Animal Faces, Vehicles, Places, and Art.

3.1.2 Dataset Methodology

To ensure significant diversity across different sources, the real images of the dataset are randomly sampled from source datasets containing numerous categories, whereas synthetic images are generated within the same categories as the real images. Captions and image masks from the COCO dataset are utilized to generate images for text2image and inpainting generators, while normally distributed noise with different random seeds is used for noise2image generators. In both cases, the generator’s default configuration is employed to produce images. To ensure that the proposed dataset reflects real-world scenarios, both real and synthetic images of the dataset undergo different impairments in accordance with the IEEE VIP Cup 2022 standards [5]. Specifically, random cropping with a ratio of $r=\frac{5}{8}$ and minimum and maximum crop sizes of $160$ and $2048$ , respectively, are applied to the images. They are then resized to $200\times 200$ before being compressed using the JPEG format with quality $Q_{f}\in[65,100]$ . Thereby, the proposed dataset accurately reflects real-world conditions, making it an ideal benchmark for evaluating the performance of synthetic image detectors.

3.2 Proposed Detection Scheme

Current approaches struggle with both unseen generators and impairments. Therefore, a robust and generalizable detector is necessary which is 1) capable of generalizing for both seen and unseen generators, with or without impairments, and 2) do not lose critical information by preserving generator traces in presence of impairments. To meet these requirements, an effective method has been proposed utilizing Multi-class Scheme (Section 3.2.1) to tackle the first challenge and Filter Stride Reduction (FSR) (Section 3.2.2) to handle the second challenge. As depicted in Fig. 3, the proposed method utilizes real and fake images from seen generators for training and testing, with fake images from unseen generators used solely as test data for predicting binary levels of authenticity in multi-class settings.

3.2.1 Multi-class Classification Scheme

This paper introduces a novel multi-class classification scheme to enhance the generalization and robustness of synthetic image detection. As depicted in Figure 4, the traditional binary classification problem of distinguishing real and fake images is transformed into a multi-class classification task with seven classes in accordance with IEEE VIP Cup [5]. Specifically, the proposed approach includes one class for identifying real images, five classes for identifying fake images from five seen generators (generators whose synthesized images are present in the training dataset), and one unique class, namely Unseen Fake (UF) for identifying fake images from unseen generators (data from generators that are not present during training). Traditional approaches are inadequate to account for these UF images. However, when the proposed method encounters images from previously unseen generators, the learning from the diverse UF class aids in labeling them as fake. Besides, the multi-class approach leverages the notion that increasing the number of classes can significantly improve the performance [25] as it exposes the model to a diverse range of visual concepts and relationships between classes, leading to the learning of more generalized and robust features. The proposed method generates multi-class predictions but the evaluation metric expects binary-class, thus multi-class prediction is converted to binary-class taking the complement of real-class prediction.

3.2.2 Filter Stride Reduction (FSR)

The utilization of social media networks often results in the resizing and compression of images, which can result in the loss of critical information causing harm to invaluable traces of generator artifacts. This problem is exacerbated when the image is processed by modern CNNs and Vision Transformers, where the reduction of resolution in the stem block can further damage crucial information. To tackle this problem, a novel approach is put forth, which reduces the filter stride in the stem block of the ConvNeXt [26] backbone by $2\times$ as shown in Fig. 4. This approach enables the reduction of information loss and preservation of generator artifacts while keeping architectural integrity and utilizing pre-trained weights, thus resulting in substantial improvement in performance.

4 Experiments

4.1 Experimental Setup

The proposed approach employs ConvNext-Large [26] as the backbone and a resolution of $200\times 200$ , with the Adam optimizer and Exponentially Decay scheduler with an initial learning rate of $10^{-4}$ . Furthermore, it utilizes Categorical Cross Entropy loss with label smoothing, with $\varepsilon=5\times 10^{-2}$ . To evaluate the detector’s generalizability reliably, a four-fold hybrid cross-validation scheme is implemented with balanced accuracy which applies KFold [27] for real class and five seen fake classes where data from the same generators can appear in both the train and test sets as well as GroupKFold [27] for the unseen fake class where there is no overlap between generators in the train and test sets. To prevent overfitting, various augmentations are also employed randomly, including scale-shift-rotate-shear, contrast-brightness-hue, flips, and cutout.

4.2 Ablation Study

An ablation study is conducted to evaluate the proposed method’s effectiveness in terms of balanced accuracy, and the results are presented in Table 1. The table shows that the Multi-class scheme with Filter Stride Reduction (FSR) and an Unseen Fake (UF) class achieved the highest accuracy of 87.62%. This approach outperformed the traditional Binary-class method by 9.41%, demonstrating the efficacy of the proposed method.

4.3 Result on IEEE VIP Cup 2022

The performance of the proposed method is evaluated in IEEE VIP Cup [5] competition at ICIP 2022 using a small portion of the proposed ArtiFact dataset, totaling 222K images of 71K real images and 151K fake images from only 13 generators. As shown in Table 2, the proposed method consistently outperforms other top teams on the leaderboard by a significant margin, with an improvement of 8.34% on Test 1, 1.26% on Test 2, and 15.08% on Test 3, as measured by the accuracy metric, thus validating the efficacy of the proposed method. It is important to note that the Test data is kept confidential from all participating teams. Additionally, the generators used for the Test 1 data are known to all teams, whereas the generators for Test 2 and Test 3 are kept undisclosed. Given that all the teams sourced train data from various sources, there is a possibility of overlap between generators used in Test 2 and Test 3 data.

4.4 Comparison with existing approaches

A quantitative comparison of the proposed method with existing techniques has been presented in Table 3 which clearly demonstrates that the proposed method exhibits exceptional performance, surpassing other methods by a significant margin in terms of balanced accuracy. These findings conclusively show the effectiveness of the proposed method in the field of synthetic image detection.

5 Conclusion

In this study, a novel dataset consisting of 2.4M images for fake image detection containing diverse generators, and object categories with real-world scenarios has been presented, along with a robust methodology that is immune to social-platform impairments and attacks from unseen generators. The proposed multi-class scheme with a dedicated class for unseen generators demonstrates exceptional performance. Additionally, the proposed filter stride reduction effectively combats the loss of critical information caused by social-platform impairments. Thus, the proposed solution is a promising contender in the field of synthetic image detection, with the potential to address crucial forensic issues related to synthetic images.

6 Acknowledgment

The authors would like to express their gratitude to the IEEE Signal Processing Society, GRIP of the University Federico II of Naples (Italy), and NVIDIA (USA) for hosting the IEEE VIP Cup competition which acted as motivation for this work.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Francesco Marra “Do gans leave artificial fingerprints?” In 2019 IEEE Conf. on Multimedia Information Processing and Retrieval (MIPR) , 2019, pp. 506–511 IEEE
2[2] Haodong Li and Bin Li “Identification of deep network generated images using disparities in color components” In Signal Processing 174 Elsevier, 2020, pp. 107616
3[3] Joel al “Leveraging frequency analysis for deep fake image recognition” In Int. Conf. on Machine Learning , 2020, pp. 3247–3258 PMLR
4[4] Scott Mc Closkey “Detecting GAN-generated imagery using saturation cues” In 2019 IEEE Int. Conf. on Image Processing (ICIP) , 2019, pp. 4584–4588 IEEE
5[5] Riccardo Corvi and Davide Cozzolino “On the detection of synthetic images generated by diffusion models” ar Xiv, 2022 DOI: 10.48550/ARXIV.2211.00680 · doi ↗
6[6] Francesco Marra “Detection of gan-generated fake images over social networks” In 2018 IEEE Conf. on Multimedia Information Processing and Retrieval (MIPR) , 2018, pp. 384–389 IEEE
7[7] Weihao Xia and Yulun Zhang “Gan inversion: A survey” In IEEE Trans. PAMI IEEE, 2022
8[8] Guillermo Iglesias, Edgar Talavera and Alberto Díaz-Álvarez “A survey on GA Ns for computer vision: Recent research, analysis and taxonomy” In ar Xiv preprint ar Xiv:2203.11242 , 2022