TL;DR
This paper introduces a novel listwise ranking model with refined view sampling for image cropping, significantly improving accuracy and speed over previous pairwise methods by better capturing view composition and reducing deformation effects.
Contribution
The paper proposes a listwise ranking approach combined with RoIRefine sampling to enhance image cropping performance, addressing key limitations of existing ranking-based methods.
Findings
Achieves state-of-the-art accuracy in image cropping
Improves speed compared to previous methods
Effectively models view composition with refined sampling
Abstract
Rank-based Learning with deep neural network has been widely used for image cropping. However, the performance of ranking-based methods is often poor and this is mainly due to two reasons: 1) image cropping is a listwise ranking task rather than pairwise comparison; 2) the rescaling caused by pooling layer and the deformation in view generation damage the performance of composition learning. In this paper, we develop a novel model to overcome these problems. To address the first problem, we formulate the image cropping as a listwise ranking problem to find the best view composition. For the second problem, a refined view sampling (called RoIRefine) is proposed to extract refined feature maps for candidate view generation. Given a series of candidate views, the proposed model learns the Top-1 probability distribution of views and picks up the best one. By integrating refined sampling and…
| Methods | Avg. Candidate | Avg. IoU | Avg. FPS |
| VFN+SW | 137 | 0.6328 | 0.77 |
| VFN+SW+ | 500 | 0.6395 | 0.22 |
| VFN+SW++ | 1125 | 0.6442 | 0.10 |
| LVRN (Ours) | 344 | 0.6773 | 197 |
| 919 | 0.6841 | 153 | |
| 1745 | 0.7100 | 125 |
| Methods | Avg. Candidate | Avg. IoU | Avg. FPS |
|---|---|---|---|
| VFN+SW | 137 | 0.6328 | 0.77 |
| A2-RL | 13.56 | 0.6633 | 4.08 |
| VPN | 895 | 0.6641 | 75 |
| LVRN (Ours) | 1745 | 0.7100 | 125 |
| Ranking Loss | Avg. IoU |
|---|---|
| Pairwise + w/o selection (260W+) | 0.5355 |
| Pairwise + simple selection (130W+) | 0.5568 |
| Pairwise + careful selection | 0.6080 |
| Listwise | 0.6204 |
| RoI Operation | Avg. IoU(Pairwise) | Avg. IoU(Listwise) |
|---|---|---|
| w/o | 0.6080 | 0.6204 |
| RoIPool | 0.6526 | 0.6706 |
| RoIAlign | 0.6709 | 0.6956 |
| RoIWarp | 0.6732 | 0.6997 |
| RoIRefine | 0.6882 | 0.7100 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Listwise View Ranking for Image Cropping
Weirui Lu1
Xiaofen Xing1
Bolun Cai1
Xiangmin Xu1
1South China University of Technology, China
{luweirui1022, caibolun}@gmail.com, {xmxu,xfxing}@scut.edu.cn
Abstract
Rank-based Learning with deep neural network has been widely used for image cropping. However, the performance of ranking-based methods is often poor and this is mainly due to two reasons: 1) image cropping is a listwise ranking task rather than pairwise comparison; 2) the rescaling caused by pooling layer and the deformation in view generation damage the performance of composition learning. In this paper, we develop a novel model to overcome these problems. To address the first problem, we formulate the image cropping as a listwise ranking problem to find the best view composition. For the second problem, a refined view sampling (called RoIRefine) is proposed to extract refined feature maps for candidate view generation. Given a series of candidate views, the proposed model learns the Top-1 probability distribution of views and picks up the best one. By integrating refined sampling and listwise ranking, the proposed network called LVRN achieves the state-of-the-art performance both in accuracy and speed.
1 Introduction
Image cropping is a common photo manipulation process, which improves the overall composition by removing unwanted regions. Image cropping is widely used in photographic, film processing, graphic design, and printing businesses. Recent methods tend to learn photo composition and extract well-composed regions from ill-composed photo.
With the development of deep learning, most of researchers have devoted their efforts to proposing deep networks based on ranking approach. For ranking-based training, a number of candidate views in each image are labelled with the aesthetic ordering. Then, the image cropping task is formalized as classification of view pairs into two categories (correctly ranked and incorrectly ranked). Finally, sliding window Chen et al. (2017b), detector Wang and Shen (2017) or reinforcer Li et al. (2018) is adopted to finding the best view.
In Chen et al. (2017a), Chen et al. first investigated learning-to-rank methods for image cropping. View finding network (VFN) Chen et al. (2017b) based on a pairwise ranking layer is proposed to model the photo composition and crop image by sliding window. Wei et al. trained a view evaluation network (VEN) Wei et al. (2018) with the pairwise siamese architecture. Inspired by knowledge distillation Hinton et al. (2015) and anchor boxes Liu et al. (2016), view proposal network (VPN) is proposed to transfer knowledge from VEN. In Li et al. (2018), reinforcement learning is adopted to crop image step by step, and each step is controlled by the aesthetic score generated by a pre-trained VFN.
However, these ranking-based cropping methods are often poor in performance, which is mainly due to two reasons:
First, pairwise training is unsuitable for image cropping process. In image cropping, the main goal is to pick up the best composed view from a list of candidate views. That is, image cropping is a listwise ranking task rather than pairwise comparison. In addition, pairwise training heavily depends on careful pair selection, because the samples with various distribution will result in training bias. Therefore, pairwise training significantly increases the computational complexity and make the training procedure unstable.
Second, coarse feature extracted from convolutional neural network (CNN) will affect the accuracy of model learning. Previous methods crop and warp the views in raw images or feature maps, and then calculate the rank score for each one. Pixel-accuracy is important in image cropping rather than object classification, and it will be reduced by warp operation. In addition, rescaling caused by the pooling layers will reduce the sampling resolution and damage the composition learning.
To overcome these problems of image cropping, we propose a listwise ranking method with refined view sampling. In refined view sampling, a novel region of interest (RoI) operation called RoIRefine is proposed to extract refined feature maps of candidate views. Instead of carefully selecting the view pairs, we take advantage of all annotated views and train the model with a listwise ranking loss.
In summary, our main contributions are:
- •
We learn deep network for image cropping with listwise ranking.
- •
We propose a refined view sampling named RoIRefine to alleviate the problem of rescaling and distortion.
- •
The proposed model significantly outperforms the state-of-the-art methods in both accuracy and speed.
2 Related Work
2.0.1 Image Cropping
Image cropping is a common operation in image editing, which aims to find views with good photo composition. A lot of methods have been proposed towards automating this task. Previous cropping methods, in general, can be divided into attention-based or aesthetic-based approaches. The attention-based methods focus on finding the most visually important area in the original image. For example, Marchesotti et al. (2009) trained a simple classifier on an annotated image database for generating attention maps. In Ciocca et al. (2007), visual saliency information, face and skin color detection results are combined for placing bounding box in image cropping. For those aesthetic-based methods, they emphasize the general attractiveness of cropped image. Fang et al. (2014) proposed a aesthetic photo cropping system which combines three models: visual composition, boundary simplicity and content preservation. A set of aesthetic quality classifiers were trained to discriminate the quality of candidate windows Wang and Shen (2017). With the development of datasets labelled by comparative aesthetic score, ranking-based methods are adopted to grade the composition of candidate windows Kong et al. (2016); Chen et al. (2017a). Recently, ranking-based methods together with other novel framework (e.g. knowledge transferWei et al. (2018) and reinforcement learningLi et al. (2018)) have achieved the state-of-the-art performance.
2.0.2 Learning to Rank
Ranking is widely used in information retrieval Yao et al. (2016), recommender systems Li et al. (2016) and software engineering Xuan and Monperrus (2014). In learning-to-rank task, training data consists of lists of items with some partial order which is specified between items in each list. Most ranking algorithms are categorized into three groups by their input representation and loss function: the pointwise, pairwise, and listwise approach Liu and others (2009).
Pointwise approaches assume that each item in the training data has a numerical or ordinal score. Then the learning-to-rank problem can be approximated by a regression problem. Ordinal regression and classification algorithms can be used to predict the score of a single item. For example, the perceptron ranking (PRank) algorithm was proposed to find a rank-prediction rule that assigns each instance a rank order Crammer and Singer (2002).
Pairwise approach formalizes the learning task as comparison of object pairs into two categories (correctly and incorrectly). RankNet Burges et al. (2005) learned a rank rule by using gradient descent methods and a natural probabilistic cost function on pairs of examples. RankBoost Freund et al. (2003) used boosting to train ranking model by minimizing classification errors on instance pairs.
Listwise approaches try to directly optimize the value over all items on training data. ListNet Cao et al. (2007) tried to define a listwise loss function for learning to rank and introduces two probability models, respectively referred to as permutation probability and Top-1 probability. Suppose that is a permutation on the n objects, and is an increasing and strictly positive function. Then, given the list of scores , ListNet defines the probability of permutation as
[TABLE]
and the top one probability of object j is defined as
[TABLE]
where means the object is ranked on top one in permutation. Thus, from Eq. (1) and (2), we can obtain
[TABLE]
where is the score of object .
In general, with the use of top one probability, cross entropy is used to represent the distance between the two given score lists.
3 The Proposed Approach
In this paper, we propose a listwise view ranking network (LVRN) for image cropping. As illustrated in Figure 1, a refined view sampling (called RoIRefine) extracts high-resolution features to rank candidate views with listwise loss.
3.1 Listwise View Ranking
To address the shortcut of pairwise approaches, we formulate composition learning as a listwise ranking problem. In this paper, the proposed model listwisely ranks the candidate views and picks up the best one.
Given a set of annotation images , each image consists of a list of candidate views , where is the number of images and is the number of views. For each view in the -th image, a rank score is labelled to represent the relative degree of view composition. For instance, the number of views is 24 in CPC dataset labelled with listwise protocol. In the view ranking network, we denote the rank function as , which takes a view (sampled from image ) as input and then outputs a rank score . For the -th image, we can obtain a list of scores from the list of views . Therefore, the ranking function can be optimized by minimizing the loss between and ground-truth scores .
Instead of pairwise approaches carefully selecting the training pairs, listwise learning removes the training bias as all candidate views are seen in each iteration. Even so, there are still a few view biases in the list – the best composed view is important than the worse ones. To address this problem, a nonlinear transformation is adopted to amplify the effect of the best one. We define as a common increasing function:
[TABLE]
According to Eq. (3), we rewrite the output scores to Top-1 probability as
[TABLE]
Similarly, the ground-truth score is rewrote as . Following Cao et al. (2007), we employ cross entropy as metric to minimize the distance between output probability and ground-truth probability . The loss function is defined as
[TABLE]
The ranking function can be simply found by minimizing the loss function . Once the ranking function is learned, we simply use it to calculate the rank scores and crop the images from candidate views.
3.2 Refined View Sampling
Coarse features extracted from CNN backbone limit the performance of image cropping. Previous methods generate candidate views from images and then warp them to a fixed size (e.g., in VFN ). However, warping is not suitable for composition learning and make the view deformed. The deformation of feature seriously damages the common composition rule, such as golden ratio, golden spiral and rule of thirds. In additional, the rescaling and multiple down-sampling in the CNN backbone make the model insensitive to view contents.
For view generation, there are three common RoI-aware operation shown in Figure 2:
- •
RoIPool Girshick (2015) (Figure 2(a)) is a standard operation for extracting a small feature map from each RoI. The quantized RoI is subdivided into spatial bins, and finally feature values covered by each bin are aggregated. The quantizations introduce misalignments between the RoI and the extracted features.
- •
RoIAlign He et al. (2017) (Figure 2(b)) removes the harsh quantization of RoIPool, properly aligning the extracted features. In each RoI, RoIAlign uses bilinear interpolation to compute the exact features at four regularly sampled locations, and aggregates the RoI features using max/average pooling.
- •
RoIWarp (Figure 2(c)) operation is proposed in Dai et al. (2016). Unlike RoIAlign, RoIWarp crops a feature map region and warps it into a target size by interpolation. Even though RoIWarp also adopts bilinear resampling, it overlooks the alignment of floating-number RoI.
These RoI-aware operations are widely used in object detection and instance segment, but unsuitable for image cropping. Inspired by RoIAlign and RoIWarp, we propose an RoIRefine layer shown in Figure 2(d) to extract high-quality features for reducing deformation. Our proposed change is simple: we sample the full-map features and resample the RoI-aware features to reduce deformation. The first bilinear interpolation improves the sampling resolution. Although the first bilinear interpolation does not increase additional information, the improvement of resolution makes the features sensitive to floating-number RoI. Without the first interpolation, we cannot achieve the float coordinate in the feature map, which means candidate boxes shift or rescale. In the other word, interpolation implements finer sampling with float-quantization. The second resampling avoids inconsistent between the feature maps and candidate views. In this paper, we simply upsample the full-map features to size and resample the RoI-aware features to the size of . Considering the trade-off between performance and efficiency, upsampling is the best choice as larger scale upsampling (4x or 8x) hardly improves performance. Compare to previous RoI-aware operations, RoIRefine leads to large improvements as shown in Section 4.3.2.
3.3 Implementation
In this paper, we initialize the backbone CNN with VGG16 pre-trained on ImageNet. All weights of the three FC layers are initialized with normal distribution (zero mean and 0.01 standard deviation), bias are set to zero and the channels are set to 1024, 512 and 1, respectively. The proposed model is trained on CPC dataset Wei et al. (2018) including 10,797 images, each with 24 candidate views. We directly rank 24 views and assign the order of views as ground-truth rank score.
During training, the images are resized to 224 224 regardless of its original size. Resizing the original image to fixed size is to fit the VGG-16 pretrained on 224x224 images (ImageNet), and is beneficial to model fine-tuning. Although resizing the original image does result global deformation, but its effect is weak. Global deformation does not affect listwise ranking because the ranking objects are views instead of original images. Every candidate views in one list only have the same global deformation, and listwise ranking loss is not sensitive to global deformation.
We trained the network for 10 epochs using stochastic gradient descent (SGD) with momentum of 0.9 and learning rate of 0.001 that decays by 0.1 after 4 epochs. The batch-size is set to 50 that means each mini-batch including candidate views cropped from 50 images. Early stopping was adopted based on validation results on FCDB dataset Chen et al. (2017a).
4 Experiments
We validate the effectiveness of the proposed model on two public image cropping databases (FCDB Chen et al. (2017a) and FLMS Fang et al. (2014)). We also compare the time efficiency on a GPU to existing image cropping models in Table 4.
4.1 Experimental Settings
To evaluate our model, we utilize the sliding window strategy of Chen et al. (2017b) to generate candidate views and choose the views with best rank score . Here we set the size of search windows among [0.6, 0.65, 0.7, …, 0.9] and the aspect ratio among [1:1, 3:4, 4:3, 9:16, 16:9]. To refine candidate views, we adopt non-maximum suppression (NMS) based on overlap ratio between candidate views and original image to generate 1,745 candidate boxes.
4.1.1 FCDB Dataset
FCDB contains 348 test images and each image is labelled by a photography hobbyist. To evaluate the generalization ability of our model, we adopt the same metrics as previous works Chen et al. (2017a); Wei et al. (2018), including intersection-over-union (IoU) and boundary displacement (Disp). The IoU can be computed as
[TABLE]
where and denote the area of the ground-truth and best-ranking crop view, respectively. Boundary displacement is given by
[TABLE]
where and denote the four corresponding edges between the ground-truth and best-ranking crop view, respectively.
4.1.2 FLMS Dataset
FLMS contains 500 test images and each image has 10 annotations from 10 different persons. The evaluation metric is a little different as it has more annotations for each image than FCDB. Following previous methods, Top-1 maximum IoU is chosen as the evaluation metric. Top-1 means to pick up the best cropping views to compute the result. We compute the IoU between the ground-truth and Top-1 views, and then choose the maximum IoU as final results.
4.2 Quantitative Evaluation
In this section, we study the cropping accuracy of our model with the state-of-the-art methods. We evaluate the performance on FCDB and FLMS dataset. VFN uses the ground truth window as the candidate views which leads to remarkable improvement, and VPN performs a post-processing by discarding small views to improve performance. For comparison fairness, the results (shown in Table 1 and Table 2) are evaluated without ground-truth windows and post-processing as Li et al. (2018).
4.2.1 FCDB Dataset
As shown in Table 1, we evaluate the cropping performance on FCDB dataset. Besides of the methods discussed above (VFN, A2-RL and VPN), we choose two other pairwise learning-to-rank methods as baselines. AesRankNet Kong et al. (2016) is proposed to rank photo aesthetics modelled by a pairwise loss function. RankSVM Chen et al. (2017a) uses AlexNet to extract aesthetic features and find the best cropping window among candidate views. According to Table 1, the proposed model achieves the best IoU and Disp scores compared to the others.
4.2.2 FLMS Dataset
We also evaluate on FLMS dataset and the results are shown in Table 2. Following Li et al. (2018), we choose Top-1 maximum IoU (Max IoU) as metric to represent cropping accuracy. In addition to ranking-based methods, two classification-based methods are also compared on FLMS dataset. Fang et al. Fang et al. (2014) learns an aesthetic-based cropping model by discriminative classifier training. In Wang and Shen (2017), attention box prediction (ABP) network and aesthetics assessment (AA) network are proposed to model the photo assessment problem as aesthetic quality classification. From experiments in Table 2, we can see that our model outperforms other methods in cropping accuracy.
4.2.3 Time Efficiency
To validate the time efficiency, we compare the time cost between our model and the state-of-the-art methods (VFN, A2-RL and VPN) on FCDB dataset. All the results in Table 4 are evaluate on the same perform with one NVIDIA GeForce 1080 GPU.
The selection of candidate views plays an important role for image cropping. In Table 4, Candidate means the number of bounding boxes used to find the best view (in VFN and VPN) or extract the evaluation feature (in A2-RL). In general, the model using most candidate views in evaluation can most likely find the best results shown in Table 3. From Table 4, the proposed method uses the most candidate views in the least time (120+ frames per second) and achieves the best accuracy.
4.3 Performance Analysis
4.3.1 Performance of Listwise Learning
To illustrate the effectiveness of listwise learning, we design a contrast experiment shown in this section. As described in Section 3, we build the VGG16-based networks without RoIRefine using different ranking losses (pairwise and listwise). We train the networks on CPC dataset and compute the rank scores for all candidate views once time. Following Chen et al. (2017b); Wei et al. (2018), the pairwise loss is defined as
[TABLE]
where and is two views selected in the same image , and is preferred more than . For listwise training, the training setting is the same as the Section 4.3.2 except that RoIRefine is not used.
The result of pairwise training heavily depends on pair selection because the samples with various distribution will result training bias. In order to compare as much as detailed, we train three models with different selection methods shown in Table 5. Without pair selection, there are more than 2.6 million pairs in CPC dataset. With simple pair selection, we set a threshold 0.5 to drop the pairs with a minor gap of rank score, and generate 1.3 million pairs. With careful pair selection, we train the model following Wei et al. (2018). The markedly improvement in Table 5 shows that listwise training overcomes the problem of the pairwise training.
4.3.2 Performance of RoIRefine
In this section, we study the improvement of the proposed RoIRefine. In order to show the contribution of using the listwise loss function when using the better view sampling(RoIRefine), we train and evaluate eight models using different ranking loss and different RoI-aware operations. The experiment results of RoIRefine and three RoI-aware operations on FCDB are shown in Table 6. The differences between these RoI-aware operations is the number and place of interpolation shown in Figure 2. RoIPool aggregates the view feature after RoI-crop without any interpolation; RoIAlign aligns the feature using interpolation before RoI-crop; RoIWarp resamples the feature using interpolation after RoI-crop. Inspired by RoIAlign and RoIWarp, RoIRefine adopts bilinear interpolation before and after RoI-crop.
RoI-aware operations reduce the deformation caused by traditional view generation and achieve the markedly improvement about 5.0% IoU. Without interpolation, RoIPool results feature deformation and achieves the worst performance of four RoI-aware operations. RoIAlign and RoIWarp removes the harsh quantization of RoI boundaries, and improve IoU by about 2.0% to 2.5% over RoIPool. RoIRefine combines pre-interpolation (RoIAlign) and post-interpolation (RoIWarp) to refine the view feature, and achieves a gain about 1.5% IoU than RoIAlign and 1.0% IoU than RoIWarp. The experiment results demonstrate that high-quality features extracted by RoIRefine can overcome the problems of rescaling and deformation.
4.4 Qualitative Visualization
As shown in Figure 3, there are five groups of qualitative results generated by different methods on FCDB dataset. Obviously, it is very intuitive comparison that our model can extract better view than the others.
For A2-RL (Figure 3(b)), reinforcement learning is sensitive to initial status and iteration step, resulting unstable performance shown in the second and fifth images. VPN (Figure 3(c)) uses 895 anchor boxes including the origin image, and tends to select the full image shown in the first two images. Because of high computational complexity, VFN cannot apply a mount of candidate views to achieve high-accuracy results shown in Figure 3(d). Comparing Figure 3(a) and Figure 3(e), we can see that our predicted boxes are close to ground-truth. In the last column, the results cropped by our method have better visual quality than the origin images.
5 Conclusions
Image cropping is a common photo manipulation process, which improves the overall composition by removing unwanted regions. In this paper, we formulate the learning of photo composition as a list-wise ranking problem to overcome the problem of pairwise-based approaches. Furthermore, a novel RoIRefine operation is proposed to extract high-quality features for view generation. The experiment results on two common datasets show that our method creates new state-of-the-art results with faster speed of 120+ frames per second.
In the future work, we will study multi-task learning to combine composition evaluation and boxes regression. Unfortunately, how to design the multi-task loss is a problematic issue. Inspired of the success of detection framework, RCNN-like Girshick (2015) or SSD-like Liu et al. (2016) method will be our first choice.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Burges et al. [2005] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning , pages 89–96. ACM, 2005.
- 2Cao et al. [2007] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning , pages 129–136. ACM, 2007.
- 3Chen et al. [2017 a] Yi-Ling Chen, Tzu-Wei Huang, Kai-Han Chang, Yu-Chen Tsai, Hwann-Tzong Chen, and Bing-Yu Chen. Quantitative analysis of automatic image cropping algorithms: A dataset and comparative study. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on , pages 226–234. IEEE, 2017.
- 4Chen et al. [2017 b] Yi Ling Chen, Jan Klopp, Min Sun, Shao Yi Chien, and Kwan Liu Ma. Learning to compose with professional photographs on the web. 2017.
- 5Ciocca et al. [2007] Gianluigi Ciocca, Claudio Cusano, Francesca Gasparini, and Raimondo Schettini. Self-adaptive image cropping for small displays. IEEE Transactions on Consumer Electronics , 53(4):1622–1627, 2007.
- 6Crammer and Singer [2002] Koby Crammer and Yoram Singer. Pranking with ranking. In Advances in neural information processing systems , pages 641–647, 2002.
- 7Dai et al. [2016] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3150–3158, 2016.
- 8Fang et al. [2014] Chen Fang, Zhe Lin, Radomir Mech, and Xiaohui Shen. Automatic image cropping using visual composition, boundary simplicity and content preservation models. In Proceedings of the 22nd ACM international conference on Multimedia , pages 1105–1108. ACM, 2014.
