Style Transfer With Adaptation to the Central Objects of the Scene

Alexey Schekalev; Victor Kitov

arXiv:1906.01134·cs.CV·June 5, 2019

Style Transfer With Adaptation to the Central Objects of the Scene

Alexey Schekalev, Victor Kitov

PDF

TL;DR

This paper introduces a style transfer method that detects central objects in images and applies style non-uniformly to preserve the recognizability of key objects like faces or text, improving visual quality.

Contribution

It proposes a novel style transfer algorithm with automatic central object detection and spatial importance masking, enhancing stylization quality over classical methods.

Findings

01

Higher quality stylization compared to classical methods

02

Three automatic central object detection methods evaluated

03

User study confirms improved preservation of key objects

Abstract

Style transfer is a problem of rendering image with some content in the style of another image, for example a family photo in the style of a painting of some famous artist. The drawback of classical style transfer algorithm is that it imposes style uniformly on all parts of the content image, which perturbs central objects on the content image, such as faces or text, and makes them unrecognizable. This work proposes a novel style transfer algorithm which automatically detects central objects on the content image, generates spatial importance mask and imposes style non-uniformly: central objects are stylized less to preserve their recognizability and other parts of the image are stylized as usual to preserve the style. Three methods of automatic central object detection are proposed and evaluated qualitatively and via a user evaluation study. Both comparisons demonstrate higher quality…

Tables1

Table 1. Table 1: Comparing baseline algorithm and proposed models.

	Percent of vote
Patches	66
Superpixel	72
Segmentation	80

Equations6

x = x arg min {α L_{content} (x, x_{c}) + L_{style} (x, x_{s})}

x = x arg min {α L_{content} (x, x_{c}) + L_{style} (x, x_{s})}

L_{content} (x, x_{c}) = α i, j \sum (F_{i, j}^{l} - P_{i, j}^{l})^{2}

L_{content} (x, x_{c}) = α i, j \sum (F_{i, j}^{l} - P_{i, j}^{l})^{2}

L^{'}_{content} (x, x_{c}) = i, j \sum α_{i, j} (F_{i, j}^{l} - P_{i, j}^{l})^{2}

L^{'}_{content} (x, x_{c}) = i, j \sum α_{i, j} (F_{i, j}^{l} - P_{i, j}^{l})^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Lomonosov Moscow State University, e-mail 11email: [email protected], 22institutetext: Plekhanov Russian University of Economics, e-mail: 22email: [email protected]

Style transfer with adaptation to the central objects of the scene

Alexey A. Schekalev 11

Victor V. Kitov 1122

Abstract

Style transfer is a problem of rendering image with some content in the style of another image, for example a family photo in the style of a painting of some famous artist. The drawback of classical style transfer algorithm is that it imposes style uniformly on all parts of the content image, which perturbs central objects on the content image, such as faces or text, and makes them unrecognizable. This work proposes a novel style transfer algorithm which automatically detects central objects on the content image, generates spatial importance mask and imposes style non-uniformly: central objects are stylized less to preserve their recognizability and other parts of the image are stylized as usual to preserve the style. Three methods of automatic central object detection are proposed and evaluated qualitatively and via a user evaluation study. Both comparisons demonstrate higher quality of stylization compared to the classical style transfer method.

keywords:

computer vision, image processing, style transfer, image classification

1 Introduction

Image stylization [1] is a classical problem in computer vision of rendering a content image in the style of another style image, as shown on Fig. 1. Earlier approaches used hard-coded rules to impose predefined style. Recently, a method of Gatys et al.[2] was proposed to impose arbitrary style on arbitrary content image using deep convolutional networks.

The main task is to transfer style from one image to another. This algorithm should work with any content and style images. In 2016 Leon Gatys proposed a method [2] of stylization, based on deep neural networks, which solved this problem. The main idea was to optimize in the space of images to find a picture semantically reflecting content from the content image and the style of the style image. These two contradicting goals were regulated by minimizing simultaneously content loss and style loss:

[TABLE]

Coefficient $\alpha$ determines the strength of stylization (Fig 2.a). Lower $\alpha$ imposes more style and vice versa. The shortcoming of this approach is that style is imposed uniformly onto the whole content image, distorting important central objects of the image, which are critical for perception. For example, it’s hard to say what bird sits on the tree (Fig. 2b), because small details of birds silhouette were lost during stylization.

One may improve preservation of content by increasing $\alpha$ coefficient in (1). However this solution decreases stylization strength globally, thus giving less expressive stylization.

The paper proposes a new solution to this problem. First, central objects are detected and selected using automatically generated spatial importance mask for the content image. Next, this mask is used to impose style with spatially varying strength, controlled by the importance mask. This allows to achieve two contradicting goals - stylization is gentle on the central objects of the image, critical for perception, such as human faces, houses, cars, etc. And stylization is strong in the rest of the image, thus expressing a vivid style.

The paper is organized as follows. Section 2 gives a description of the proposed method and provides qualitative comparisons with the baseline stylization method of Gatys et al. Section 3 provides the details of the user evaluation study and summarizes its results, highlighting the superiority of the proposed solution. Section 4 concludes.

2 Method

2.1 Non-uniform Stylization

Consider the loss function in the optimization problem (1). In the original paper [2] content loss is formalized as follows:

[TABLE]

where $F^{l}$ and $P^{l}$ are internal representations in pre-trained convolutional neural network [3], which is selected to be VGG [4]. Instead of using constant $\alpha$ , we propose to use a matrix with different $\alpha_{i,j}$ values for each spatial location $(i,j)$ :

[TABLE]

Making variable $\alpha$ allows to impose less style on central objects of the scene, critical for perception, and more style in all other areas of the image..

2.2 Automatic Central Objects Detection

Consider convolutional neural network pre-trained for image classification. We use VGG [4]. Such model takes input image and outputs probability distribution for each class from the ImageNet set. We detect central objects by filling different parts of the input image with uniform color and measuring change in the output class probabilities. If key object of the image was filled, one would observe drastic change in class probabilities. On the contrary, if background was changes, class probabilities would change only slightly. Overall, the magnitude of change of class probabilities determines the importance of the filled region. This approach was used to visualize convolutional neural networks in classification problems [5], but in the problem of style transfer, to our knowledge, it is used for the first time. After splitting whole image into a set of regions and filling each region one by one and evaluating its importance, we build a whole importance map $\alpha_{i,j}$ measuring semantic significance of each location of the image. This importance map is passed to the spatially varying style transfer algorithm (1) with modified content loss function (3).

2.2.1 Patch-Based Mask Generation

In this approach we propose to divide the image by a uniform patch grid (like at Fig 3.b). Sequentially overwriting the patches and passing the image through the neural network, we rate the importance of the patches by calculating $L_{2}$ norm of class distributions difference. Visualization of results shows that proposed algorithm could find central object of the scene and separate them from background (Fig. 4.a). After that we use found $\alpha_{i,j}$ matrix in stylization algorithm with changed content loss (3). At Fig.4 (b and c) we could see the difference between baseline approach and proposed model. There are a lot of small details at dog face failed to save in baseline approach and could save in new model.

At the example above (Fig. 4.a) we see, that main patch covers not only the central object, but it covers background too. Instead of using fixed patches we additionally propose to use previous algorithm for different position of the grid mesh and combine results together (Fig.5a) by pixelwise averaging. We see the difference between two approaches at Fig.5(b and c). Averaging of different matrices allows to obtain more smooth distribution of weights, so it allows to define the boundaries of central objects better.

2.2.2 Superpixel-Based Mask Generation

At the example above (Fig. 5) we see that averaging of different $\alpha_{i,j}$ matrices produces boundary of elliptical form. If central objects have more complicated boundaries, the proposed method becomes unsuitable. To improve the results, instead of using a uniform grid, we suggest to split the image into superpixels [6]. This algorithm divides the image into small segments (superpixels), the boundaries of which are close to the boundaries of the objects in the image (Fig. 6a). Superpixel algorithm has two main parameters, responsible for the number of segments and the shape of boundaries. We choose a set of predefined values of these parameters and run importance mask evaluation algorithm several times, then average the results for better quality (Fig. 6b)

Fig 7 shows qualitative difference between uniform stylization (a) and patch-based (b) and superpixel-based (c) spatially varying stylization. Boundaries of the central object – the glass – are non-convex, thus superpixel-based extracts the boundary of such object better, which improves the quality of final stylization.

2.2.3 Segmentation-Based Mask Generation

Deep learning models are good at image segmentation tasks [7]. So we could evaluate $\alpha_{i,j}$ matrix by previous approaches and then correct boundaries by the results of the segmentation algorithm. This approach allow to increase quality of stylization when it’s easy to separate object from background. Example at fig. 8 shows, that stylization algorithm with segmentation locates the car exactly along its border, while superpixel algorithm affect some pixels near the car, which makes final style transfer less sharp along the border of the central object of the image.

3 Results

To evaluate quantitatively the proposed method we a user evaluation study. For a representative set of content and style images two stylizations were obtained — by the baseline method of Gatys et al. and by the proposed method. Respondents were asked to select a stylization they like more. To omit location bias for each comparison baseline stylization and stylization with the proposed method were shown in random order. 6 respondents were surveyed on 29 stylization outputs. We conducted 3 surveys, comparing baseline stylization algorithm, of Gatys et al. with our method, where importance mask was generated using patches, superpixels and results of image segmentation. Results are shown on table 1.

From these results we see that our method outperforms baseline stylization in all cases. Image segmentation modification gives maximum benefit, which can be attributed to the fact that it extracts the boundaries of central objects more accurately.

4 Conclusion

A new style transfer method with spatially varying strength was proposed in this work. Stylization strength was controlled for each pixel by automatically generated importance mask. Three methods, namely patch-based, segmentation-based and superpixel-based were proposed to generate importance mask. Qualitative comparisons and conducted user evaluation study demonstrated superiority of the proposed method compared to classical style transfer method of Gatys et al. due to expressive style transfer for the background and more gentle style transfer for the central objects of the content image. Among three proposed importance mask generation methods, segmentation-based showed the highest quality which may be attributed to more accurate boundary estimation of the central objects of the image.

Bibliography7

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] https://research.adobe.com/news/image-stylization-history-and-future/
2[2] Gatys L., Ecker A., Bethge M. Image Style Transfer Using Convolutional Neural Networks // IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, P. 2414-2423.
3[3] Krizhevsky A., Sutskever I., Hinton G. Imagenet classification with deep convolutional neural networks // Advances in neural information processing systems, 2012, P. 1097-1105.
4[4] Simonyan K., Zisserman A. Very deep convolutional networks for large-scale image recognition // ar Xiv preprint ar Xiv:1409.1556. 2014.
5[5] Zeiler M., Fergus R. Visualizing and understanding convolutional networks // European conference on computer vision. 2014. P. 818-833.
6[6] https://www.pyimagesearch.com/2014/07/28/a-slic-superpixel-tutorial-using-python/
7[7] Zhou, Bolei and Zhao, Hang and Puig, Xavier and Xiao, Tete and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio. Semantic understanding of scenes through the ade 20k dataset // International Journal on Computer Vision 2018