Image Aesthetics Assessment Using Composite Features from off-the-Shelf Deep Models
Xin Fu, Jia Yan, Cien Fan

TL;DR
This paper introduces a training-free approach for image aesthetics assessment that leverages composite features from pretrained deep models, outperforming existing methods without requiring model fine-tuning.
Contribution
The method utilizes off-the-shelf deep features from global, local, and scene-aware information, demonstrating superior performance over state-of-the-art techniques.
Findings
Deep residual networks produce more aesthetics-aware features.
Composite features improve overall assessment accuracy.
The approach outperforms existing methods on benchmark datasets.
Abstract
Deep convolutional neural networks have recently achieved great success on image aesthetics assessment task. In this paper, we propose an efficient method which takes the global, local and scene-aware information of images into consideration and exploits the composite features extracted from corresponding pretrained deep learning models to classify the derived features with support vector machine. Contrary to popular methods that require fine-tuning or training a new model from scratch, our training-free method directly takes the deep features generated by off-the-shelf models for image classification and scene recognition. Also, we analyzed the factors that could influence the performance from two aspects: the architecture of the deep neural network and the contribution of local and scene-aware information. It turns out that deep residual network could produce more aesthetics-aware…
| Dataset | High | Low | Train | Test | |
| AVA | AVA1 | 74,673 | 180,856 | 235,599 | 19,930 |
| AVA2 | 25,553 | 25,553 | 25,553 | 25,553 | |
| CUHKPQ | 10,524 | 19,166 | 14,845 | 14,845 | |
| Model | AlexNet | VGG-16 | ResNet-50 |
| AVA2 | 53.2 | 82.1 | 87.7 |
| CUHKPQ | 73.5 | 87.1 | 90.3 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
IMAGE AESTHETICS ASSESSMENT USING COMPOSITE FEATURES FROM OFF-THE-SHELF DEEP MODELS
Abstract
Deep convolutional neural networks have recently achieved great success on image aesthetics assessment task. In this paper, we propose an efficient method which takes the global, local and scene-aware information of images into consideration and exploits the composite features extracted from corresponding pretrained deep learning models to classify the derived features with support vector machine. Contrary to popular methods that require fine-tuning or training a new model from scratch, our training-free method directly takes the deep features generated by off-the-shelf models for image classification and scene recognition. Also, we analyzed the factors that could influence the performance from two aspects: the architecture of the deep neural network and the contribution of local and scene-aware information. It turns out that deep residual network could produce more aesthetics-aware image representation and composite features lead to the improvement of overall performance. Experiments on common large-scale aesthetics assessment benchmarks demonstrate that our method outperforms the state-of-the-art results in photo aesthetics assessment.
**Index Terms— ** Image Aesthetics, Deep Learning, Feature Extraction, Pretrained Models
1 Introduction
Photographic devices like digital camera and smartphone being widely spread allow individuals to take photos more conveniently than ever. Meanwhile, great effort and time must be paid to sift through the piles of images stored in devices and cloud storage. Therefore, automatically picking out aesthetically pleasing images is very useful under such circumstances. Figure 1 shows us some examples of good and bad aesthetic images.
Recent years, the research community addressed this challenging problem by developing ways to classify the images into binary categories of high quality and low quality [1]. Early work mainly focused on various hand-crafted aesthetic features and feature representations such as SIFT or color descriptors [2]. With the evolution of deep learning, deep neural network (DNN), especially convolutional neural network (CNN), has been successfully used in various fields, such as image classification [3, 4, 5], object detection, scene recognition [6] and so on. The state-of-the-art performance achieved by deep neural network proves its powerful ability of feature representation for various visual tasks. Consequently, deep learning-based techniques have been adopted in aesthetics assessment tasks over the past few years and have successfully gained better performance than conventional approaches [1, 2, 7].
Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. Existing popular deep neural networks are carefully designed for visual tasks and have been trained on large-scale datasets comprised of millions of images like ImageNet [8], so these deep neural networks have powerful ability of extracting generic image representation that could be applied to other similar visual tasks [9, 10]. The generated generic features contain information with respect to their aesthetics as well. Therefore, to achieve better results, more advanced model is needed. However, shallow neural networks are still widely adopted in many deep learning-based techniques for aesthetics assessment. Furthermore, nearly all of these methods require fine-tuning or training a model from scratch, which is so inefficient that the process typically consumes days or weeks [11]. Our work tries to overcome these disadvantages by making the most of the potential of existing trained deep learning models.
In this paper, we propose a more efficient method using off-the-shelf deep neural networks for both image classification and scene recognition to extract deep representation of images from three perspectives. By bringing the global, local and scene-aware representations together to yield the composite features intended for further classification, we are able to enhance the results of aesthetics assessment. Compared to other state-of-the-art approaches that need re-training or fine-tuning a DNN model, our proposed neural network training-free method have achieved better performance.
2 Related Work
In this section, we will give a brief overview of recent works from two different aspects.
2.1 Image Aesthetics Assessment
Methods in image aesthetics assessment could generally be divided into three distinct categories: classical handcrafted low-level features, generic features based on image descriptors, and the contemporary approach of utilizing deep learning models. Datta et al. [12] proposed visual features based on standard photography and visual design rules to encapsulate aesthetic attributes from low-level image features. Marchesotti et al. [13] proposed to learn aesthetic attributes from textual comments on the photographs using generic image features.
Recently, deep learning methods have been applied to image aesthetic assessment [11, 7, 14] and have significantly improved the prediction precision against previous non-deep methods. Tian et al. [15] proposed a query-dependent aesthetic model based on feature representation learned from CNN. Dong et al. [16] proposed to adopt the generic features from the penultimate layer output of AlexNet with spatial pyramid pooling. Wang et al. [17] proposed a CNN modified from AlexNet by stacking seven scene convolutional layers. Jin et al. [11] proposed ILGNet derived from part of the GoogLeNet which contains Inception module.
2.2 Deep Neural Networks for Computer Vision Tasks
**Evolution of Architectures: **A variety of DNN models have been developed and achieved huge success in different computer vision tasks these years. AlexNet [3] was the first CNN to win the ImageNet Challenge in 2012 and it consists of five convolutional(CONV) layers and three fully-connected(FC) layers. VGG-16 [4] goes deeper to 16 layers consisting of 13 CONV layers and 3 FC layers. ResNet [5] uses residual connections to go even deeper (34 layers or more). It was the first entry DNN in ImageNet Challenge that exceeded human-level accuracy with a top-5 error rate below 5%. Previous work in [18, 9] shows that, generally, the better performance a DNN could achieve in ImageNet classification task, the more effective deep features it could extract.
**DNN for Scene Recognition: **Scene recognition is a challenging problem since scenes not only provide visual information from the level of objects but also the relationship between them. Deep convolutional neural networks trained on places (Places-CNNs) have shown impressive results in scene recognition tasks [6, 10] and have been applied in many areas. The content and scenery of an image are fundamental to its aesthetics and are sometimes overlooked in the assessment. Also, it’s been proved that taking image content into account can improve the accuracy of image aesthetics prediction [14, 17].
3 Method
In this section, we will give a detailed description of our method. As shown in Figure 2, we exploit three parallel deep neural networks and each of them is used to extract specific deep features from the input image. By aggregating the extracted features, we are able to classify them with a classifier.
3.1 Off-the-shelf CNN Features
Deep convolutional neural networks trained on a 1.2 million subset of the ImageNet dataset can be employed as a general feature extractor. Following the previous work [9, 10], we directly take the trained neural network weights from their original published work with no modification.
DNNs like AlexNet and VGG-16 contain CONV layers at the top and FC layers at the bottom. We then directly take the 4096-dimensional activations from the first FC layer as the features that will be used later for classification. For ResNet, we need to utilize the features from its penultimate layer, i.e., the average pooling layer (AvgPool), which typically is 2048-dimensional.
A recent work [18] studies the effectiveness of ImageNet features and concludes that ResNet models are better extractors. We did similar experiments and the results are basically consistent. Thus, in our work, we choose ResNet-50 as the extractor and prove it’s a better model on aesthetics assessment task by comparing it with AlexNet and VGG-16.
3.2 Using Composite Features From Different Nets
The advantage of our method is that we exploit three parallel deep neural networks to extract unique features from three different aspects, including the global view, local view and scene-aware information. We refer to the final aggregated features as composite features that will be used by the following classifier.
**Global View: **In order to extract effective features representing the picture as a whole, a column of DNN for image classification is used as the global view feature extractor. By resizing the given image to a fixed size and feeding it to the network, the global view features of this image could be acquired.
**Local View: **The local view of an image is closely related to its aesthetic evaluation and is often overlooked. Instead of using randomly sampling parts from the original high-resolution images, we crop the center area of the image by a fixed ratio (0.62) as the local view which is more visually representative since people pay more attention to the center. We deliver the cropped part to the identical deep neural network that has been previously used for global view and get the required feature vector.
**Scene-aware Information: **Besides global and local view, the content of an image has much to do with its overall aesthetics. Zhou et el. [6] published Places Database comprising 10 million scene photographs, labeled with 434 scene semantic categories. Among the DNNs they have trained, Places365-ResNet reaches 85.07% top-5 accuracy, which is the highest of all. We then utilize the scene-aware features extracted by the Places365 model based on ResNet-50 as additional information to improve the performance of our method.
4 Experiments
We conducted experiments to evaluate the performance of different architectures of deep neural networks. We also analyze the contribution of local view and scene-aware information respectively.
4.1 Datasets
AVA [2] and CUHKPQ [19] are the datasets we use in our experiments. We build the subset AVA1 following [2, 11, 14, 17], and AVA2 following [16, 17, 20]. And CUHKPQ is set up as [20]. Details are presented in Table 1.
4.2 Evaluating the Impact of Network Architectures
We extract features from three different DNNs as Section 3.1 described. The extracted features can be easily separated into different categories by traditional machine learning classifiers such as Support Vector Machine (SVM), Random Forest, AdaBoost and so on. Decent results could be achieved by these simple classifiers. Here, we adopt SVM with RBF kernel as our classifier. It is worth noting that we never try to fine-tune the classifier parameters so that we could get the authentic validity of the features.
To find out which off-the-shelf deep neural network trained on ImageNet could generate the most effective features for aesthetics task, we carry out a basic experiment on AVA2 subset using pre-trained networks to learn deep abstractions which are then be classified with a simple SVM.
As Table 2 shows, the deep features obtained from these various architectures resulted in different accuracy. AlexNet performs the worst of three. VGG-16 follows with better results. ResNet-50 achieves the highest among the three on both datasets.
It’s obvious that utilizing ResNet models could bring about more generic features that lead to better accuracy in image aesthetics assessment, which is consistent with the conclusion from [18] and experiments of Figure 3 further prove this.
4.3 Evaluating the Benefits of Local and Scene-aware Information
Deep representation extracted by a single pre-trained model alone is not sufficient for getting more promising result. Here, we’d like to demonstrate the influential contributions of the local and scene-aware information of the input image. The local view features are generated by the same deep neural network previously used by the global view. Moreover, to testify the effectiveness of scene-aware information, we used the features generated from the Places365-ResNet from [6], which is based on ResNet-50 and fine-tuned for scene classification.
The benefits of local and scene-aware information are evaluated and presented in Figure 3. For AVA2 subset, with only global view, the accuracy of classifying features generated by VGG-16 is 82.1% and by ResNet-50 is 87.7%. With both global view and scene-aware information, the accuracy increases by 1.7% and 0.8% respectively. With global and local view, it rises by 1.9% and 2.2%. Finally, when we use the composite features made up of global, local and scene-aware information, the accuracy goes much higher and reaches 85.4% and 90.0% on AVA2 subset.
Further, we conduct similar experiment on a smaller dataset, CUHKPQ, and same tendency is found, as shown in Figure 3 (b). With the composite features from ResNet-50 used, our method achieves 94.1%, which has already excelled the state-of-the-art result in [20].
According to Figure 3, ResNet-50 performs better, so we run experiments on AVA1 subset using ResNet-50. The increase of the accuracy is shown in Figure 4, regardless of the fact that the overall precision is restricted by the untuned SVM and the huge amount of pictures scoring around 5 which make them hardly separated. Again, the result shows the benefit of using composite features.
Another experiment we do is cross-dataset evaluation. We train a SVM classifier based on the features extracted from the global view of the whole AVA2 dataset and use this model to classify the features from the entire CUHKPQ dataset. The result, 87.2%, further verifies the effectiveness of both the deep representation gained from ResNet model and the SVM model.
All these results together prove that by using more kinds of features extracted from the image, better performance could be acquired. Meanwhile, deep neural network like ResNet-50 is more excellent at extracting generic deep features containing aesthetics information than VGG-16 and AlexNet.
4.4 Comparison with the State-of-the-Art
Table 3 shows the results of our proposed method on AVA2 subset for image aesthetics categorization. It is obvious that our method achieves the state-of-the-art result compared to other recently proposed methods. Specifically, by using the composite features generated by pretrained ResNet-50 for image classification and scene recognition, we first bring the accuracy on AVA2 subset up to 90.01%, which outperforms all the existing methods.
5 Conclusion
This paper presents an effective and efficient scheme that using composite features generated from deep pretrained convolutional neural networks leads to an increase on the accuracy of image aesthetics assessment. Our proposed training-free method takes the local, global and scene-aware information of images into consideration and utilizes the off-the-shelf deep learning models in the procedure of feature extracting. Our experimental analysis demonstrates that our method achieves superior performance in comparison to other state-of-the-art approaches.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Y. Deng, C. C. Loy, and X. Tang, “Image Aesthetic Assessment: An experimental survey,” IEEE Signal Processing Magazine , vol. 34, no. 4, pp. 80–106, July 2017.
- 2[2] N. Murray, L. Marchesotti, and F. Perronnin, “AVA: A large-scale database for aesthetic visual analysis,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition , June 2012, pp. 2408–2415.
- 3[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems , 2012, p. 2012.
- 4[4] Karen Simonyan and Andrew Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” ar Xiv:1409.1556 [cs] , Sept. 2014.
- 5[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2016, pp. 770–778.
- 6[6] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million Image Database for Scene Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. PP, no. 99, pp. 1–1, 2017.
- 7[7] Xin Lu, Zhe Lin, Hailin Jin, Jianchao Yang, and James Z. Wang, “RAPID: Rating Pictorial Aesthetics using Deep Learning,” in Proceedings of the 22nd ACM International Conference on Multimedia . 2014, pp. 457–466, ACM.
- 8[8] J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and Li Fei-Fei, “Image Net: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition , June 2009, pp. 248–255.
