TL;DR
This paper introduces a new dataset of eye movements on transformed images, analyzes gaze behavior changes, and develops GazeGAN, a novel GAN-based saliency model that leverages data augmentation and multi-level features for improved robustness and performance.
Contribution
The study creates a new transformed image saliency dataset, analyzes gaze shifts due to transformations, and proposes GazeGAN, a GAN-based model with novel features and loss functions for superior saliency prediction.
Findings
Label-preserving transformations improve saliency prediction.
Certain transformations degrade model performance when not label-preserving.
GazeGAN outperforms existing models and is robust to image perturbations.
Abstract
Data size is the bottleneck for developing deep saliency models, because collecting eye-movement data is very time consuming and expensive. Most of current studies on human attention and saliency modeling have used high quality stereotype stimuli. In real world, however, captured images undergo various types of transformations. Can we use these transformations to augment existing saliency datasets? Here, we first create a novel saliency dataset including fixations of 10 observers over 1900 images degraded by 19 types of transformations. Second, by analyzing eye movements, we find that observers look at different locations over transformed versus original images. Third, we utilize the new data over transformed images, called data augmentation transformation (DAT), to train deep saliency models. We find that label preserving DATs with negligible impact on human gaze boost saliency…
| Transformations | Generation code (using Matlab) | IO scores: sAUC |
| Reference | 100 distortion-free images (img) from CAT2000 | 0.733 |
| MotionBlur1 | imfilter(img, fspecial(‘motion’, 15, 0)) | 0.664 |
| MotionBlur2 | imfilter(img, fspecial(‘motion’, 35, 90)) | 0.651 |
| Noise1 | imnoise(img, ‘gaussian’, 0, 0.1) | 0.706 |
| Noise2 | imnoise(img, ‘gaussian’, 0, 0.2) | 0.696 |
| JPEG1 | imwrite(img, saveroutine, ‘Quality’, 5) | 0.703 |
| JPEG2 | imwrite(img, saveroutine, ‘Quality’, 0) | 0.705 |
| Contrast1 | imadjust(img, [ ], [0.3, 0.7]) | 0.722 |
| Contrast2 | imadjust(img, [ ], [0.4, 0.6]) | 0.702 |
| Rotation1 | imrotate(img, -45, ‘bilinear’, ‘loose’) | 0.680 |
| Rotation2 | imrotate(img, -135, ‘bilinear’, ‘loose’) | 0.654 |
| Shearing1 | imwarp(img, affine2d([1 0 0; 0.5 1 0; 0 0 1]) | 0.711 |
| Shearing2 | imwarp(img, affine2d([1 0.5 0; 0 1 0; 0 0 1]) | 0.687 |
| Shearing3 | imwarp(img, affine2d([1 0.5 0; 0.5 1 0; 0 0 1]) | 0.665 |
| Inversion | imrotate(img, -180, ‘bilinear’, ‘loose’) | 0.695 |
| Mirroring | mirror symmetry version of reference images | 0.726 |
| Boundary | edge(img, ‘canny’, 0.3, sqrt(2)) | 0.667 |
| Cropping1 | a band from the left of img | 0.697 |
| Cropping2 | a band from the top side of img | 0.692 |
| Dataset | Loss functions | CC | NSS | sAUC | KL |
|---|---|---|---|---|---|
| LSUN’17 | + KL + CC + NSS | 0.823 | 1.388 | 0.686 | 0.876 |
| + KL + CC + NSS + | 0.849 | 1.493 | 0.718 | 0.587 | |
| + KL + CC + NSS + HistLoss | 0.855 | 1.557 | 0.712 | 0.606 | |
| + KL + CC + NSS + HistLoss + | 0.881 | 1.911 | 0.738 | 0.373 |
| Dataset | Component Modules | CC | NSS | sAUC | KL |
|---|---|---|---|---|---|
| LSUN’17 | : Plain U-Net | 0.752 | 1.221 | 0.613 | 0.824 |
| : + Residual blocks | 0.849 | 1.472 | 0.689 | 0.530 | |
| : + CSC Module | 0.865 | 1.609 | 0.718 | 0.489 | |
| : + Local Generator | 0.881 | 1.911 | 0.738 | 0.373 |
| sAUC | IG | NSS | CC | AUC | SIM | KL | |
| GazeGAN | 0.736 | 0.720 | 1.899 | 0.879 | 0.864 | 0.773 | 0.376 |
| SAM-ResNet | 0.741 | 0.538 | 1.990 | 0.899 | 0.865 | 0.793 | 0.610 |
| EML-Net | 0.746 | 0.716 | 2.050 | 0.886 | 0.866 | 0.780 | 0.520 |
| DI-Net | 0.739 | 0.195 | 1.959 | 0.902 | 0.862 | 0.795 | 0.864 |
| CEDNS | 0.745 | 0.357 | 2.045 | 0.862 | 0.862 | 0.753 | 1.026 |
| lvjincheng | 0.726 | 0.613 | 1.829 | 0.856 | 0.855 | 0.705 | 0.376 |
| hallazie | 0.724 | 0.640 | 1.804 | 0.844 | 0.855 | 0.714 | 0.381 |
| RyanLui | 0.724 | -0.187 | 1.838 | 0.855 | 0.850 | 0.746 | 1.208 |
| hrtavakoli | 0.717 | 0.541 | 1.773 | 0.848 | 0.845 | 0.684 | 0.492 |
| sfdodge | 0.720 | 0.646 | 1.911 | 0.821 | 0.856 | 0.722 | 0.527 |
| CC | NSS | AUC-Borji | sAUC | SIM | KL | |
| GazeGAN | 0.760 | 2.140 | 0.865 | 0.643 | 0.663 | 0.781 |
| SAM-VGG [6] | 0.753 | 2.134 | 0.859 | 0.612 | 0.668 | 0.831 |
| SAM-ResNet [6] | 0.760 | 2.128 | 0.862 | 0.622 | 0.659 | 0.878 |
| ML-Net [2] | 0.586 | 1.698 | 0.793 | 0.623 | 0.541 | 0.796 |
| OpenSALICON [4] | 0.543 | 1.539 | 0.822 | 0.634 | 0.511 | 0.783 |
| SalGAN [3] | 0.561 | 1.524 | 0.820 | 0.633 | 0.489 | 0.864 |
| Sal-Net [5] | 0.553 | 1.433 | 0.828 | 0.600 | 0.484 | 0.874 |
| GBVS [10] | 0.521 | 1.341 | 0.821 | 0.585 | 0.468 | 0.879 |
| Itti&Koch [9] | 0.439 | 1.118 | 0.783 | 0.582 | 0.430 | 1.021 |
| AUC-Judd | CC | NSS | AUC-Borji | sAUC | SIM | KL | |
| GazeGAN | 0.891 | 0.808 | 2.914 | 0.878 | 0.743 | 0.764 | 0.496 |
| SAM-VGG [6] | 0.879 | 0.756 | 2.900 | 0.850 | 0.712 | 0.722 | 0.545 |
| SAM-ResNet [6] | 0.886 | 0.774 | 2.860 | 0.856 | 0.727 | 0.733 | 0.533 |
| OpenSALICON [4] | 0.886 | 0.748 | 2.823 | 0.833 | 0.726 | 0.720 | 0.516 |
| ML-Net [2] | 0.863 | 0.669 | 2.392 | 0.840 | 0.704 | 0.716 | 0.577 |
| SalGAN [3] | 0.807 | 0.703 | 1.987 | 0.810 | 0.707 | 0.712 | 0.580 |
| Sal-Net [5] | 0.853 | 0.557 | 1.430 | 0.803 | 0.677 | 0.690 | 0.615 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
How is Gaze Influenced by Image
Transformations? Dataset and Model
Zhaohui Che, Ali Borji, Guangtao Zhai, Xiongkuo Min, Guodong Guo, and Patrick Le Callet Manuscript received March 04, 2019; revised August 29, 2019; accepted September 26, 2019. Date of publication XXXX-XXXX, XXXX; date of current version XXXX-XXXX, XXXX. This work was supported by the National Science Foundation of China (61831015, 61771305, 61927809 and 61901260), and in part by the China Postdoctoral Science Foundation under Grants BX20180197 and 2019M651496. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Soma Biswas. (Corresponding author: Guangtao Zhai.)Z. Che, G. Zhai, X. Min are with the Institute of Image Communication and Network Engineering, Shanghai Key Laboratory of Digital Media Processing and Transmissions, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: {chezhaohui,zhaiguangtao,minxiongkuo}@sjtu.edu.cn).A. Borji is a senior research scientist at MarkableAI Inc, Brooklyn, NY 11201, USA (e-mail: [email protected]).G. Guo is with the Institute of Deep Learning, Baidu Research, Beijing 100193, China, and also with the Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV USA (e-mail: [email protected]).P. L. Callet is with the Équipe Image, Perception et Interaction, Laboratoire des Sciences du Numérique de Nantes, Université de Nantes, France (e-mail: [email protected]).Digital Object Identifier XXXXXXXX
Abstract
Data size is the bottleneck for developing deep saliency models, because collecting eye-movement data is very time-consuming and expensive. Most of current studies on human attention and saliency modeling have used high-quality stereotype stimuli. In real world, however, captured images undergo various types of transformations. Can we use these transformations to augment existing saliency datasets? Here, we first create a novel saliency dataset including fixations of 10 observers over 1900 images degraded by 19 types of transformations. Second, by analyzing eye movements, we find that observers look at different locations over transformed versus original images. Third, we utilize the new data over transformed images, called data augmentation transformation (DAT), to train deep saliency models. We find that label-preserving DATs with negligible impact on human gaze boost saliency prediction, whereas some other DATs that severely impact human gaze degrade the performance. These label-preserving valid augmentation transformations provide a solution to enlarge existing saliency datasets. Finally, we introduce a novel saliency model based on generative adversarial networks (dubbed GazeGAN). A modified U-Net is utilized as the generator of the GazeGAN, which combines classic “skip connection” with a novel “center-surround connection” (CSC) module. Our proposed CSC module mitigates trivial artifacts while emphasizing semantic salient regions, and increases model nonlinearity, thus demonstrating better robustness against transformations. Extensive experiments and comparisons indicate that GazeGAN achieves state-of-the-art performance over multiple datasets. We also provide a comprehensive comparison of 22 saliency models on various transformed scenes, which contributes a new robustness benchmark to saliency community. Our code and dataset are available at: https://github.com/CZHQuality/Sal-CFS-GAN.
Index Terms:
Human Gaze, Saliency Prediction, Data Augmentation, Generative Adversarial Networks, Model Robustness.
I Introduction
Visual attention is a sophisticated mechanism for selecting informative and conspicuous regions from external stimuli [1]. To the best of our knowledge, most of current human attention studies and saliency models are based on stereotype stimuli, *e.g. * distortion-free images and upright scenes. However, most of stimuli in the real physical world are corrupted by diverse transformations.
As an example, we present a practical case in the first column of Fig. 1. When viewing the original canonical image, human attention is highly attracted to the “station board”, because this region provides critical semantic information that helps observers to recognize the scene as a “railway station”. On the other hand, when adding noise to this scene, the “station board” region still attracts most of human attention. However, when the “station board” is cropped, human gaze is significantly changed. We can see that most of human attention transfers to the “advertisement board” and the blurred “metro”, because these salient objects help observers to understand the new transformed scene. These cases raise new concerns about human gaze invariance on transformed scenes.
In the past decades, a plethora of saliency models [2, 3, 4, 5, 6, 1, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21] have been proposed to detect saliency regions, which serve as an efficient front-end process to complex vision tasks such as scene understanding and object recognition [22, 23, 24].
Despite their great successes in stereotype clean stimulus, most of current saliency models, either recent deep models or early hand-crafted models, are vulnerable to transformations. As shown in the second column of Fig. 1, the SALICON [4] model is susceptible to noise artifacts, and produces severe false positives such as “hand”, also misses important true positives like “face”. Therefore, it is important to investigate new robust approaches to reach the human level accuracy on transformed scenes.
Some related works regard human attention over transformed conditions. Kim *et al. * [25] investigated visual saliency over noisy images and proposed a model for noise-corrupted images. They found that noise significantly degrades the accuracy of saliency models. Judd *et al. * [26] elaborately investigated gaze over low-resolution images, and compared gaze dispersion on different image resolutions. Zhang et al. [27] investigated the optimal strategy to integrate attention cues into perceptual quality assessment, and showed that eye-tracking data on transformed images improves perceptual quality assessment methods.
These works, however, only considered certain types of transformations, limited amount of data, and a small set of saliency models. Further, they did not investigate the potential of various transformations for boosting saliency modeling (*e.g. *by serving as data augmentation). In this paper, we conduct a comprehensive study on the impacts of several transformations on both human gaze and saliency models. We also explore potential application and introduce a robust saliency model.
II The Proposed Eye-Movement Database
II-A Stimuli and transformation types
We selected 100 distortion-free reference images from the CAT2000 eye-movement database [28] since it covers various scenes such as indoor and outdoor scenes, natural and man-made scenes, synthetic patterns, fractals, and cartoon images. Considering that different reference images have different aspect ratios, we padded each image by adding two gray bands to the left and right sides and adjusted the image scale to make sure all images have the same resolution ().
To systematically assess the influence of ubiquitous transformations on human attention behavior, we choose 19 common transformations that could occur during the whole image acquisition, transmission, and displaying chain, including:
- •
Acquisition: 2 levels of motion blur and 2 levels of Gaussian noise,
- •
Transmission: 2 levels of JPEG compression,
- •
Displaying: 2 levels of contrast change, 2 rotation degrees, and 3 shearing transformations,
- •
Other: inversion, mirroring, line drawing (boundary maps), and 2 types of cropping distortions (to explore gaze variations under extremely abnormal conditions).
Eventually, we derive 18 transformed images for each reference image, and a total of 1900 images (18 100 + 100 reference images). Details of transformation types and generation code are shown in Table 2. Notably, these transformations are wildly used as data augmentation transformations for training deep neural networks to mitigate overfitting [29].
II-B Eye-tracking setup
As indicated by Bylinskii et al. [30], the eye-tracking experimental parameters (*e.g. *observers’ distance to screen, calibration error, image size) impact human gaze invariance. To mitigate these issues, we utilized the Tobii X120 eye tracker to record eye-movements. We used the LG 47LA6600 CA monitor with horizontal resolution of 1920 and vertical resolution of 1080, to match the resolutions of stimuli and the monitor screen. The height and width of the monitor were 60cm and 106cm, respectively. The distance between subject and the eye-tracker was 60cm. According to Bylinskii et al. [31], one degree of visual angle was used both as
- an estimate of the size of the human fovea, and
- to account for measurement error. In our experiment, the width of the screen subtended of visual angle, and of horizontal angle corresponding to 56.91 pixels ( and 56.55 pixels for the screen height, correspondingly).
Two types of ground-truth data have been traditionally used for training and measuring the accuracy of saliency models:
- binary fixation maps made up of discrete gaze points recorded by an eye-tracker, and
- continuous density maps representing the probability of the human gaze. The former can be converted into the latter by a Gaussian smoothing filter with standard deviation equal to one degree of visual angle [32], hence we chose in this paper.
We recruited 40 subjects to participate in the eye tracking experiment under the free-viewing condition. All participants had not been exposed to the stimuli set before. The duration time for each stimulus was . We inserted a gray image with duration between each two consecutive images to reset gaze to the image center for reducing the impact of memory effects [33] on gaze invariance. Besides, the presentation order of stimuli was randomized for each subject to mitigate the carryover effect from the previous images.
III Analysis of Human Gaze Invariance
In this section, we quantify the discrepancies between human gaze over transformed and reference images using Pearson’s Linear Correlation Coefficient (CC), Histogram Intersection Measure (SIM), and Kullback-Leibler divergence (KL) metrics [34]. The CC/SIM similarity matrices and KL dissimilarity matrix are shown in Fig. 2, where the transformation types are ranked by their similarity/dissimilarity values compared to the Reference images. Since Inversion, Mirroring, Rotation and Shearing transformations change the locations of pixels, we align gaze maps of these transformations with the Reference gaze map via the corresponding inverse transformations for fair comparison.
We first analyse human gaze invariance from a statistical perspective. As shown in Fig. 2, quantitative comparisons on CC, SIM and KL metrics indicate that most of the transformations impact human gaze, and the magnitude of impact highly depends on the transformation type. Besides, different magnitudes of the same transformation have similar impacts on human gaze, *e.g. *Noise1 vs Noise2, JPEG1 vs. JPEG2, and MotionBlur1 vs. MotionBlur2, and higher distortion magnitude causes severer impact. Third, we cannot directly use all of these transformations as data augmentation transformations for saliency prediction, because some transformations are not label-preserving in terms of human gaze.
Next, we provide a fine-grained analysis of human gaze under different transformations from a qualitative perspective.
Cropping: As shown in Fig. 3 and Fig. 4, Cropping transformation may delete some saliency information from the cropped side. For example, in the column of Fig. 4, human attention transfers from “station board” to “advertising boards”. Despite the critical semantic information (*i.e. *“station board”) being cropped, observers can still recognize the cropped image as a “railway station” via new salient objects (*i.e. *“advertising boards” and “metro”). Thus, we arrive at the following empirical inference. When a scene is cropped, human gaze tends to focus on salient regions with more semantic information that help understand the cropped scene.
Rotation, Shearing: Rotation and Shearing are spatial geometric transformations that alter original structural information and produce non-rigid objects. As we can see in Fig. 5 and Fig. 6, when viewing the rotated/affine-transformed stimuli, human gaze still focuses on semantic objects, but the intensities of the saliency regions are significantly changed by the geometric transformations. For example, in the first column of Fig. 6, when viewing Reference image, human gaze focuses on the “guide board” and “pedestrians”, and the “guide board” attracts more human attention than “pedestrians”. When viewing the affine-transformed image, although human fixations still locate at the “guide board” and “pedestrians” regions, the “pedestrians” attract more human attention. The similar cases can be observed in the and columns of Fig. 5, and the column of Fig. 6.
Noise and Compression: Noise and Compression are spatial perturbations that alter pixel intensities or texture, but maintain the structural information of the Reference image. Statistical comparison in Fig. 2 indicates that humans tolerate these spatial perturbations, demonstrating better invariance with regards to the Reference images.
Boundary: Boundary transformation maintains most of the structural information of the Reference images, but lacks the texture, color and luminance information. As shown in Fig. 7, we notice that the semantic objects still attract human gaze, *e.g. *“shoe” and “face” in the and columns. However, for the scenes without clear semantic information, *e.g. *the column, human gaze tends to focus on regions with sharp edges, thus causes discrepancy with the Reference image. Statistical comparison in Fig. 2 indicates that Boundary transformation has sever impact on human gaze invariance compared to the spatial perturbations such as Noise and Compression, but results in better invariance than geometric transformations. Thus, we arrive at another empirical inference: For upright and rigid scenes, low-level structural and texture information helps to detect high-level salient regions.
Mirroring, Inversion: Although Inversion is a special case of Rotation with rotation angle, it demonstrates better invariance with Reference than geometric transformations. This is because Mirroring and Inversion are symmetric versions of Reference images and maintain both structural and texture information. As shown in Fig. 8, although human fixations on Mirroring and Reference have slight discrepancy on the trivial salient regions, they are consistent on major salient objects with obvious semantic information, such as “face” and “pedestrians”.
Here, we list the lessons learned from our invariance analysis and the ways they can help saliency modeling as follows.
- •
Discriminative semantic objects: When a scene is cropped, human attention tends to focus on the salient regions with more semantic information that help to understand the cropped scene and to recover from the information loss.
- •
Highlighting semantic salient information while ignoring trivial artifacts: We verified that human gaze focuses on semantic objects over various transformations, besides, human gaze tolerates the trivial artifacts caused by transformations such as JPEG and Noise distortions. In order to reach human level accuracy on transformed scenes, the robust saliency models should emphasize semantic salient regions while mitigating trivial artifacts.
- •
Leveraging structural and texture information: For upright and rigid scenes, low-level structural and texture information helps to detect the salient regions.
- •
Combining multiple metrics: There is no “perfect” metric that can accurately quantify human gaze on various transformations. However, they can complement each other. 333Please see supplement for more details on the properties of different evaluation metrics on various transformations.
Finally, we briefly discuss the impact of human attention invariance to other vision tasks such as object detection and classification. As we know that, region proposal has been successfully adopted in object detection [35]. Saliency detection shares similar mechanism and goal with region proposal. Besides, in classification task, top-down attention mechanism encodes semantic discriminative regions to boost classification convolution network [36]. Different transformations will change the region proposal results at different levels. Wrong (or missing) region proposal will cause severe impact on final prediction of detection and classification applications. Thus, the lessons via human attention analysis are generalizable to a plethora of attention-based detection and classification applications. The robust approach should emphasize top-down semantic regions, and refine trivial bottom-up discriminative regions, in order to produce accurate region proposal.
IV Analysis of Data Augmentation
The most common data augmentation strategy is to enlarge the training set using some label-preserving transformations, such as Cropping, Inversion, ContrastChange, and Shearing. However, different from classical image classification and object detection problems, the common data augmentation methods may produce label noise for the saliency prediction problem. This is because different transformations will change the ground truth at different levels. This work carries important implications as to which of these types of transformations are valid and which ones provide approximations of human gaze. We divide common transformations included in the proposed dataset into two sets: valid and invalid augmented sets, and explore how fine-tuning on different sets of augmented data can improve or degrade the performance of deep models with respect to ground truth.
On the one hand, we select Reference, Mirroring, Inversion, Contrast1, Shearing1, JPEG1 and Noise1 to generate a valid augmented set, because these transformations have slight impacts on human gaze. On the other hand, Rotation1, Rotation2, Shearing2, Shearing3, Cropping1, Cropping2 and MotionBlur2 serve as an invalid set, because these transformations are not able to preserve human gaze labels as approximations of the Reference. We select 4 state-of-the-art deep saliency models, *i.e. *SAM-VGG [6], SAM-ResNet [6], ML-Net [2], and OpenSALCON [4], for a comprehensive investigation.
We design and perform two experiments in this section:
1. Which types of transformations can improve the model robustness on distorted images?
2. Do the valid augmentation transformations increase the model performance on normal distortion-free images?
In the first experiment, we select some distortion-free images from the CAT2000 dataset as a normal control group, because the proposed dataset has similar content with CAT2000, such as indoor, outdoor, fractals and cartoon images. Specifically, each of valid, invalid and normal control group is divided into a training set (550 images) and a test set (150 images), respectively. We borrow 100 images from CAT2000 as validation set for selecting optimal hyper-parameters.
In the first experiment, the model training process includes two steps, *i.e. *pre-training and fine-tuning. First, each model is pre-trained on SALICON dataset. This dataset contains 10,000 training images, 5,000 validation images and 5,000 test images. Next, we fine-tune the pre-trained models on 3 different datasets, *i.e. *valid transformed set, invalid transformed set, and distortion-free CAT2000 set, as shown in the and rows of Fig. 9.
In the second experiment, we select 1500 distortion-free images from CAT2000 as original training set, 400 images as test set, and 100 images as validation set. Then, we use the valid transformations to enlarge the original training set of CAT2000 to 10500 images. Similarly, the deep models are first pre-trained on SALICON training set. We then fine-tune the pre-trained models using the augmented CAT2000 training set (10500 images) and the original CAT2000 training set w/o augmented data (1500 images), respectively. Performance comparisons of these two fine-tuning strategies are shown in the row of Fig. 9.
For fair comparison, we unify the experimental setup for different data augmentation strategies. In the pre-training stage, we set the training hyper-parameters as follows: 1) For the 4 deep models mentioned in Fig. 9, stochastic gradient descent (SGD) serves as the optimization function with momentum of 0.9, weight decay of 0.0005, and the batch size of 1, and 20 training epochs, 2) For the ML-Net, learning rate is , 3) For OpenSALICON, learning rate is , and 4) For SAM-VGG and SAM-ResNet, initial learning rates are set to , and are decreased by 10 every two epochs for SAM-ResNet, and every three epochs for SAM-VGG. In the fine-tuning stage: 1) We also adopt SGD with momentum of 0.9 and weight decay of 0.0005, and set batch size to 1, fine-tuning epoch to 10, 2) For ML-Net, learning rate is , 3) For OpenSALICON, learning rate is , and 4) For SAM-VGG and SAM-ResNet, initial learning rates are , and are decreased by 10 every two epochs for SAM-ResNet, and every three epochs for SAM-VGG.
Experimental results shown in the row of Fig. 9 verify that fine-tuning using the valid transformed set can improve deep models’ robustness on the distorted test set, compared to using CAT2000 which contains only distortion-free images. However, as shown in row of Fig. 9, fine-tuning using the invalid transformed set degrades deep models’ performances compared to using normal stimuli. The results of the row of Fig. 9 indicate that the valid transformations provide an efficient data-augmentation approach to utilize expensive eye-movement data for boosting deep saliency models.
V The Proposed GazeGAN Model
We recall the lessons learned from human gaze analyses (*i.e. *Section-III), and list the general ideas behind the proposed model as follows:
- •
Conditional GAN (for discriminating semantic object): The generator aims to fool the discriminator that is trained to distinguish synthetic saliency maps from real human gaze. The discriminator conditioned by the transformed images can boost generator to focus on semantic salient objects as real human gaze;
- •
Center-surround connection (for highlighting semantic information, while mitigating trivial artifacts): Inspired by human visual center-surround antagonism mechanism, we propose a novel cross-scale short connection module, which helps model output to mitigate wrong predictions caused by trivial artifacts, while concentrating on semantic salient objects, in order to reach human level accuracy on transformed scenes;
- •
Skip-connections (for leveraging structural and texture information): Skip-connections combine low-level structural and texture features from encoder layers with high-level semantic features from decoder layers, because the low-level features also help to detect salient regions;
- •
Local-global GANs (more robust to scale transformation): Multiple generators learn different groups of spatial representations in different scales, while multiple discriminators can improve the intermediate prediction results from coarse to fine.
V-A The generator
As shown in Fig. 10, the backbone GazeGAN generator is a modified U-Net equipped with a novel “center-surround connection module” (CSC module).
U-Net is a powerful fully convolutional network presented by Olaf et al. [37]. It has made a great breakthrough in biomedical image segmentation by predicting each pixel’s class. In saliency prediction, the goal of U-Net is predicting each pixel’s probability of being salient. Compared to the generator of SalGAN [3] saliency model (*i.e. *VGG-16), U-Net consists of symmetric encoder and decoder layers, and utilizes skip connections to combine low-level structural and texture features from encoder layers with high-level semantic features from decoder layers.
An important early vision mechanism in the human vision system that serves recognition and attention is the “center-surround” mechanism. The early visual neurons (retina and LGN) are most sensitive in a small region of the visual space (*i.e. *center of receptive field), while stimuli presented in the antagonistic region concentric to the center (the surround) inhibit the neuronal response [9]. The “Center-surround” mechanism highlights local spatial discontinuities and is well-suited for detecting salient locations that stand out from their surround while suppressing other trivial information, such as noise and artifacts.
For improving the robustness of deep saliency models, we add the “center-surround” mechanism into the CNN model for the first time. Here, we implement the “center-surround” operation as a cross-scale short connection module, because it is generalizable to any encoder-decoder CNN architecture, as shown in Fig. 10. Specifically, we select the feature maps in a coarse scale (the surround) from the decoder layer, where , and the corresponding fine scale maps (the center) are from the decoder layer, where . We first use a transposed convolution layer to upsample the surround feature maps to have the same resolution (height width) with the center maps. Besides, in the upsampling stage, this transposed convolution also learns to reduce the wrong predictions caused by trivial artifacts. Next, we employ the 11 convolution layers to unify the channels of center and surround maps while keeping the resolution fixed. Then, we compute the preliminary center-surround output by an element-wise summation as:
[TABLE]
where and represent the surround feature maps of the layer, and the center feature maps of the layer, respectively. represents the convolution, and represents the transposed convolution. represents the preliminary center-surround response of and . is an element-wise summation. is the convolution operation.
Next, we utilize the attention mechanism to further highlight the semantic saliency regions detected by . Specifically, we feed the into a convolution to squeeze the channel amount as 1, thus obtain a 2D one-channel map . Then, we compute the 2D normalized attention map of via softmax function:
[TABLE]
The final output of the CSC module is computed by the element-wise product of and normalized attention map as , where is the expanded 3D attention map via duplicating the 2D map in channel direction.
Finally, we concatenate the obtained center-surround maps with the other feature maps from the decoder layer in channel direction. This way, each of the decoding layers concatenate 3 types of feature maps, *i.e. *. Specifically, is low-level structural/texture features from the encoder layer via skip-connection, is semantic features from the decoder layer, and is center-surround features via CSC modules that mitigate trivial artifacts. In Fig. 10, we use red, yellow, and blue rectangles to represent , respectively. Notice that all activation functions in encoder layers are leaky-ReLUs with slope 0.2, while activation functions in decoder layers are normal ReLUs.
In the final architecture shown in Fig. 10, we further append a local generator on basis of the global generator , in order to extract more high-resolution features. is able to detect the fine-scale semantic salient objects, while encodes more salient objects in coarse scales, *i.e. *small face and tiny text. Specifically, we concatenate the feature maps from the last decoder layer of with the feature maps from the second encoder layer of , to integrate the global semantic information from coarse to fine. We feed the original image into the , and feed the downsampled image into the . The and are jointly trained end to end.
V-B The discriminator
To discriminate real human gaze from synthetic saliency map, we train a 5-layer patch-based discriminator [38], which contains 4 convolution layers with increasing number of convolution kernels, increasing by a factor of 2 from 64 to 512 kernels. On top of the 512 feature maps generated by the discriminator layer4, we append a sigmoid layer with filter kernels and sigmoid activation function to obtain the final probability of being the real human gaze. Notice that we concatenate the saliency map (or human gaze) with original input color image in channel direction, and feed them to the discriminator simultaneously. Thus, GazeGAN is a conditional GAN [38] because both the generator and the discriminator can observe the input source image, as shown in Fig. 11. Particularly, the conditional discriminator has access to both input images (including transformation type) and the corresponding saliency maps, demonstrating better discrimination ability on semantic objects than the normal discriminator. We append the conditional discriminators to the end of and , respectively, in order to improve the predictions from coarse to fine.
V-C Loss functions
In the human gaze analysis section, we found that there is no “perfect” evaluation metric that can accurately quantify human gaze on various transformations. However, metrics can compensate for each other. Previous works [38, 39, 40] have proved it beneficial to mix the adversarial loss with some task-specific content losses to train a GAN.
V-C1 The content loss
For saliency detection task, it has been proved that a linear combination of different saliency evaluation metrics achieves a good performance [4, 6].
CC, KL and NSS [41] metrics 444See supplement for more details about KL, CC, and NSS losses. perform well in measuring the pixel-level similarity between ground-truth and synthetic maps. However, we found that only using a linear combination of pixel-level losses produce high discrepancy between the grey-level histograms of synthetic result and human gaze. 555Please see supplement for visualization of this issue.
For solving the drawbacks of pixel-level losses, we propose a histogram loss to reduce the histogram discrepancy between the generated saliency map and the human gaze map. The histogram loss includes two steps, i.e. histogram distribution estimation and histogram similarity calculation. For constructing a differentiable histogram loss, we first devise the histogram estimation method based on Ustinova’s work [42]. We denote the pixel luminance of saliency map as , , where represents the number of pixels in the saliency map. Suppose that the distribution of is estimated as the -dimensional histogram with the nodes = 0, = , …, = 255 uniformly filling with the step . Then, we use equation 3 to estimate the probability distribution (denoted as , where ) for each node of the histogram.
[TABLE]
We then adopt the min-max normalization method to normalize as , to guarantee that . Next, we utilize the Alternative Chi-Square (ACS) distance to measure the histogram similarity.666Derivative of proposed loss is provided in Supplementary Material
[TABLE]
where and represent the normalized probability distribution at the node of histograms of generated saliency map and ground-truth human gaze, respectively. is a smoothing term to avoid division by zero. We set to 255.
As shown in equation 5, the final content loss is a linear combination of four pixel-level losses , KL, CC and NSS, and a histogram loss . In Section-, we quantify the contribution of each loss function via ablation study.
[TABLE]
where , are five scalars to balance five losses, and the good default settings are 1, 10, -2, -2 and 1, respectively. The good default scalars are tested and selected via SALICON validation set. The smaller values for , KL and scores indicate higher similarity between synthetic result and ground-truth, whereas for CC and NSS, the higher values indicate higher similarity.
V-C2 The adversarial loss
The adversarial loss is expressed as
[TABLE]
where means the original input image, while G and D represent generator and discriminator. G represents the global and local generators (*i.e. * and ), while D represents the fine-scale and coarse-scale discriminators. G tries to minimize this adversarial loss against an adversarial D which tries to maximize it, i.e. .
VI Experiments and Results
VI-A Experimental setup
We use 4 datasets to ensure a comprehensive comparison including: 1) SALICON dataset (previously released) [43]; 2) LSUN’17 dataset (SALICON-2017-released-version) [44]; 3) MIT1003 dataset [45] and 4) The proposed dataset.
For SALICON, MIT1003, and LSUN’17 datasets, we resize input images to for saving computing resources. Considering that the images of MIT1003 have different resolutions, we apply zero padding bringing images to have a unified aspect ratio of 4:3 and resize them to have the same size. Images of the proposed dataset have the same input size of , hence we resize them to .
For fair comparison, all of the deep-learning based models are trained from scratch on the SALICON (previously released) dataset. Specifically, we first adopt the proposed valid data augmentation transformations to enlarge the 10,000 training images. This way, we obtain another 60,000 augmented stimulus set with 6 types of label-preserving transformations. For SALICON [4], SAM-VGG [6], SAM-ResNet [6] and SalGAN [3] models, we follow their authors’ guideline to initialize their network parameters using the pre-trained weights on ImageNet [46]. The proposed GazeGAN is initialized from a Gaussian distribution with mean 0 and standard deviation 0.02, which achieves similar performance with the ImageNet initialization method. We use the augmented 70,000 training samples to train all of the competing models. We select 4,000 images from the SALICON validation set as the test set and the remaining 1,000 images serve as the validation set for selecting the optimal hyper-parameters.
For MIT1003 dataset, we randomly divided it into a training set with 600 images, a validation set with 100 images, and a test set with 303 images. We use the same data augmentation method to enlarge the training set of MIT1003 dataset. For all competing models, we reload the parameters pre-trained on the augmented SALICON training set. We then fine-tune the models on the augmented MIT1003 training set.
The proposed dataset consists of 19 transformation groups, and each group contains 100 images. We divide each group into 60 training images, 10 validation images and 30 test images. This way, we obtain 1140 training samples, 190 validation samples and 570 test samples. Similarly, for all competing models, we reload the parameters pre-trained on the augmented SALICON training set, then we fine-tune the models on 1140 training samples of proposed dataset.
For LSUN’17 dataset, the performance scores of other competing models are from LSUN’17 SALICON Saliency Prediction Competition system [44], where our model is under the username “codacscgaze”.
In the training stage, we encourage the generator of the proposed GazeGAN to minimize the linear combination of the content loss and the adversarial loss . Besides, rather than training the discriminator to maximize , we instead minimize -. Adam optimizer [47] with a fixed learning rate , and the momentum parameter of serves as the optimization method to update the model parameters. We alternatively update the generators and discriminators as suggested by Goodfellow et al. [48]. The batch-size is set as 1. Our implementation is based on Pytorch and Tensorflow flowcharts, using NVIDIA Tesla GPU.
VI-B Ablation analysis
In this section, we evaluate the contribution of each component of the proposed model. We first compare the performance of GazeGAN when using different losses, as shown in Table II. We find that the combination of pixel-level losses, histogram loss, and adversarial loss achieves superior performance over different evaluation metrics.
Next, we focus on the contributions of different modules of our model. For this purpose, we construct four different variations: : the plain U-Net, : the plain U-Net integrated with four residual blocks, : the modified equipped with the CSC module, and is constructed by appending the local generator to . Table III shows the ablation analysis results on LSUN’17 validation set. We can see that every module contributes to the final performance. We provide more ablation study results on SALICON (previously released), MIT1003, and the proposed dataset in the supplementary material.
VI-C Comparison with the state-of-the-art
We first quantitatively compare GazeGAN with state-of-the-art models on SALICON (old version), MIT1003, LSUN’17 (SALICON-2017-version), and the proposed dataset. Experimental results are reported in Tables IV-VII. GazeGAN achieves top-ranked performance on the SALICON (old version) validation set and proposed dataset over different evaluation metrics. It also obtains competitive performance on the MIT1003 and LSUN’17 datasets.
The qualitative results are shown in Figs. 12-14. We notice that, GazeGAN generates accurate results for various transformed scenes, as in Fig. 12. Besides, on normal stimuli in Fig. 13 and Fig. 14, GazeGAN performs well, even for challenging scenes containing multiple faces, gazed-upon objects and text, as in Fig. 14.
VI-D Finer-grained comparison on transformed dataset
As shown in Fig. 15, we further provide the fine-grained comparison of 22 existing saliency models on each transformation type of the proposed dataset.
For comprehensive comparison, we select 15 early saliency models based on hand-crafted features, IttiKoch [9], GBVS [10], Torralba [11], CovSal [12] (CovSal-1 utilizes covariance feature and CovSal-2 utilizes both of covariance and mean features), AIM [13], Hou [14] (Hou-Lab and Hou-RGB adopt Lab and RGB color spaces respectively), LS [15], LGS [15], BMS [16], RC [17], Murray [18], AWS [19] and ContextAware [20]. We also select 7 deep saliency models, GazeGAN, ML-Net [2], SalGAN [3], OpenSALICON [4], Sal-Net [5], SAM-ResNet [6] and SAM-VGG [6].
We observe the following points from Fig. 15:777We provide more results under CC and KL metrics in the supplement.
- •
: Rotation2, Shearing3, Noise2 and Contrast2 are the most challenging transformations for saliency models. Most saliency models underperform on these transformations. Rotation2 and Shearing3 impose severe geometrical transformations, while Noise2 and Contrast2 include high level spatial perturbations. The former changes the spatial structure of image, while the latter alters intensities and local contrast. Recall that Rotation2 and Shearing3 also have severe impacts on human gaze.
- •
: LS and LGS fail on Boundary. Sal-Net fails on Contrast2. ML-Net and OpenSALICON fail on Noise2 and Contrast2. CovSal-1 and CovSal-2 fail on sAUC metric, especially on Rotation and Boundary, because the CovSal model overemphasizes center-bias which is penalized by the sAUC metric.
- •
: Deep saliency models obtain higher performances compared to the early models based on hand-crafted features. GazeGAN achieves top-ranked average performance over different metrics. Besides, GazeGAN is robust to various types of transformation without obvious failures.
VI-E Discussion on the robustness of GazeGAN
As indicated in Fig. 12, Fig. 15 and Table VI, the proposed GazeGAN achieves better robustness against various transformations. In this section, we discuss the robustness of the proposed model from different perspectives.
- •
: Our proposed CSC module has two advantages. It mitigates the trivial artifacts, and highlights semantic salient information, as shown in Fig. 16. For example, in the column of Fig. 16, we notice that the compression artifacts cause wrong predictions in the surround feature maps, and we want to mitigate the impacts of these trivial artifacts. Besides, despite the surround feature maps can detect semantic salient regions (*e.g. *“pedestrians”), the responses of semantic salient regions are not strong enough. Thus we want to further emphasize the responses of these semantic salient regions. We can see that the final output processed by CSC module concentrates on semantic salient regions, while ignoring the trivial artifacts.
- •
: Second, CSC module improves the nonlinearity of the proposed deep model. Specifically, each individual CSC module contains three convolution layers and one transposed convolution layer. We append a nonlinear ReLU activation after each convolution layer. Besides, we utilize eight CSC modules in the proposed GazeGAN architecture in total, that are nonlinear activations. According to [50, 51], the higher model nonlinearity increases the representational ability of deep neural network, demonstrating better robustness against transformations.
- •
: Hendrycks *et al. * [52] pointed that multiscale architectures achieve better robustness by propagating features across different scales at each layer rather than slowly gaining a global representation of the input as in traditional CNNs. GazeGAN utilizes both skip-connections, CSC connections, and local-gloabl GAN architectures. Both of these factors adequately leverage multiscale features.
- •
: Hybrid adversarial training is a defense strategy for improving robustness of deep CNN models against adversarial attacks [51]. This method utilizes an ensemble of original images and the adversarial examples to train the deep models. Adversarial examples are the manually generated images by adding some slight perturbations to original images [51]. In fact, the proposed valid data augmentation strategy provides a similar solution, which is adopting the examples corrupted by an ensemble of several transformations to train the deep CNNs. This hybrid adversarial training strategy is currently the most effective method to improve model robustness, and prevents overfitting to a specific transformation type [51].
VII Conclusion
In this article, we introduce a new eye-tracking dataset containing several common image transformations. Based on our analyses of eye-movement data, we propose a valid data augmentation strategy using some label-preserving transformations for boosting deep-learning based saliency models. Besides, we propose a new model called GazeGAN integrated with a novel center-surround connection module that mitigates trivial artifacts while emphasizing semantic salient regions, demonstrating better robustness against various transformations. GazeGAN achieves the best results on the transformed dataset, and obtains competitive performance on normal distortion-free benchmark datasets. We share our dataset and code with the community at https://github.com/CZHQuality/Sal-CFS-GAN, where we provide both Pytorch and Tensorflow versions of the code. Our repository provides a flexible interface for users to integrate their own architectures and to promote research on improving the robustness of saliency models over non-canonical stimuli.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Borji and L. Itti. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence , Vol. 35, No. 1, pp. 185-207, 2013.
- 2[2] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara. A deep multi-level network for saliency prediction. In Proceedings of IEEE International Conference on Pattern Recognition , pp. 3488-3493, 2016.
- 3[3] J. Pan, C. Canton, K. Mc Guinness, and et al. Salgan: Visual saliency prediction with generative adversarial networks. In ar Xiv preprint ar Xiv:1701.01081 , 2017.
- 4[4] Thomas and Christopher. Opensalicon: An open source implementation of the salicon saliency model. In ar Xiv preprint ar Xiv:1606.00110 , 2016.
- 5[5] J. Pan, K. Mc Guiness, E. Sayrol, N. Conner, and et al. Shallow and deep convolutional networks for saliency prediction. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognization , pp. 598-606, 2016.
- 6[6] M. Cornia, L. Baraldi, G. Serra, and et al. Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Transactions on Image Processing , Vol. 27, No. 10, pp. 5142-5154, 2018.
- 7[7] A. Borji, D. N. Sihite, and L. Itti. Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study. IEEE Transactions on Image Processing , Vol. 22, No. 1, pp. 55-69, 2013.
- 8[8] A. Borji, H. R. Tavakoli, D. N. Sihite, and L. Itti. Analysis of scores, datasets, and models in visual saliency prediction. In Proceedings of IEEE International Conference on Computer Vision , pp. 921-928, 2013.
