Son of Zorn's Lemma: Targeted Style Transfer Using Instance-aware   Semantic Segmentation

Carlos Castillo; Soham De; Xintong Han; Bharat Singh; Abhay Kumar; Yadav; and Tom Goldstein

arXiv:1701.02357·cs.CV·January 11, 2017

Son of Zorn's Lemma: Targeted Style Transfer Using Instance-aware Semantic Segmentation

Carlos Castillo, Soham De, Xintong Han, Bharat Singh, Abhay Kumar, Yadav, and Tom Goldstein

PDF

Open Access

TL;DR

This paper introduces a targeted style transfer method that uses instance-aware semantic segmentation and Markov random fields to selectively stylize objects in images, enabling applications like augmented reality and cartoon rendering.

Contribution

It presents a novel approach combining segmentation and style transfer with boundary smoothing for selective stylization of image objects.

Findings

01

Effective object segmentation and stylization achieved

02

Smooth blending of stylized objects with surroundings

03

Applicable to augmented reality and artistic rendering

Abstract

Style transfer is an important task in which the style of a source image is mapped onto that of a target image. The method is useful for synthesizing derivative works of a particular artist or specific painting. This work considers targeted style transfer, in which the style of a template image is used to alter only part of a target image. For example, an artist may wish to alter the style of only one particular object in a target image without altering the object's general morphology or surroundings. This is useful, for example, in augmented reality applications (such as the recently released Pokemon GO), where one wants to alter the appearance of a single real-world object in an image frame to make it appear as a cartoon. Most notably, the rendering of real-world objects into cartoon characters has been used in a number of films and television show, such as the upcoming series Son of…

Figures5

Click any figure to enlarge with its caption.

Equations10

x minimize structure ∥ F (x) - F (t) ∥^{2} + s t y l e ∥ C (x) - C (s) ∥^{2}

x minimize structure ∥ F (x) - F (t) ∥^{2} + s t y l e ∥ C (x) - C (s) ∥^{2}

L (θ) = L_{b} (B (θ)) + L_{m} (M (θ) ∣ B (θ)) + L_{c} (C (θ) ∣ M (θ), B (θ))

L (θ) = L_{b} (B (θ)) + L_{m} (M (θ) ∣ B (θ)) + L_{c} (C (θ) ∣ M (θ), B (θ))

U (p, l) = ∥ p - c^{l} ∥.

U (p, l) = ∥ p - c^{l} ∥.

B (p_{1}, l_{1}, p_{2}, l_{2}) = ∣ I_{l_{1}} (p_{1}) - I_{l_{2}} (p_{1}) ∣^{2} + ∣ I_{l_{2}} (p_{2}) - I_{l_{1}} (p_{2}) ∣^{2} .

B (p_{1}, l_{1}, p_{2}, l_{2}) = ∣ I_{l_{1}} (p_{1}) - I_{l_{2}} (p_{1}) ∣^{2} + ∣ I_{l_{2}} (p_{2}) - I_{l_{1}} (p_{2}) ∣^{2} .

E (l) = p \sum U (p, l_{p}) + {p, q} \in N \sum B (p, l_{p}, q, l_{q}) .

E (l) = p \sum U (p, l_{p}) + {p, q} \in N \sum B (p, l_{p}, q, l_{q}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Aesthetic Perception and Analysis · Computer Graphics and Visualization Techniques

Full text

Son of Zorn’s Lemma: Targeted Style Transfer Using Instance-aware Semantic Segmentation

Abstract

Style transfer is an important task in which the style of a source image is mapped onto that of a target image. The method is useful for synthesizing derivative works of a particular artist or specific painting. This work considers targeted style transfer, in which the style of a template image is used to alter only part of a target image. For example, an artist may wish to alter the style of only one particular object in a target image without altering the object’s general morphology or surroundings. This is useful, for example, in augmented reality applications (such as the recently released Pokémon go), where one wants to alter the appearance of a single real-world object in an image frame to make it appear as a cartoon. Most notably, the rendering of real-world objects into cartoon characters has been used in a number of films and television show, such as the upcoming series Son of Zorn. We present a method for targeted style transfer that simultaneously segments and stylizes single objects selected by the user. The method uses a Markov random field model to smooth and anti-alias outlier pixels near object boundaries, so that stylized objects naturally blend into their surroundings.

**Index Terms— ** Style transfer, Instance-aware semantic segmentation, Convolution neural network, Markov random fields, Image filtering

1 Introduction

Style transfer is an important task in computer graphics in which the style (line stokes, textures, and colors) of a source image is mapped onto that of a target image. Automated style transfer software facilitates the conversion of real-world images into the appropriate style to form the background in cartoons, simulations, and other renderings. The method is also useful for generating derivative works of a particular artist or painting. The concept of style transfer is behind popular apps like Prisma, which convert real-world photos into different artistic styles.

In this paper, we propose targeted style transfer, in which the style of a template image is used to alter only part of a target image. For example, an artist may wish to alter the style of only one particular object in a target image without altering the object’s general morphology or surroundings. This is useful, for example, in augmented reality applications (such as the recently released Pokémon go), where one wants to alter the appearance of a single real-world object in an image frame to make it appear as a cartoon. Most notably, the rendering of real-world objects into cartoon characters has been used in a number of films and television show, such as the upcoming series Son of Zorn (see Fig. 1).

We present a method for targeted style transfer that simultaneously segments and stylizes a single object selected by the user. The method performs the object transformation using deep network-based image modification hybridized with semantic segmentation. The method integrates a Markov random field model to smooth and anti-alias outlier pixels near object boundaries, so that stylized objects naturally blend into their surroundings without visible seams.

1.1 Related Work

Whole-image style transfer has been studied by a number of authors. Style transfer was first proposed in [1, 2], in which a deep convolutional network was used to transduce the style of a source image onto the target. The transformed image is recovered by minimizing an energy functional with two terms. The first term measures the semantic similarity between the target image and generated image, as quantified by the 2-norm difference between the deep features of each image. The second term measures the texture similarity between the generated image and the source image. Texture information is extracted from each image using covariance matrices that capture the correlations between deep features within an image. This method of texture extraction was first proposed in [3] for texture synthesis.

Several authors have proposed improvements to the style transfer model, although all have been focused on whole images. In [4], the authors modify the transfer algorithm to be color preserving. This is accomplished by modifying the source image to have a color profile similar to the target before performing the style mapping. The authors of [5] speed up style transfer by using a simplified “perceptual loss function” to compute the similarity between images. In [6], the authors present a model for data-driven image synthesis that, given an image, automatically creates a variant that looks similar but differs in structure. The model uses a combination of generative Markov random fields and deep convolutional neural networks (dCNN) for synthesizing the images.

Since this paper solves the problem of style transfer for a targeted object, our approach needs to generate a mask for each object in the target image. Therefore, it is also related to object detection [7, 8, 9] and semantic segmentation [10, 11]. Faster R-CNN [8] introduces a Region Proposal Network that predicts object boxes and objectness scores at the same time with an almost cost-free region proposal process. DeepMask [10] trains a neural network with two objectives jointly. Also, [11] proposes a cascaded network with three stages, which predict box instances, mask instances, and categorized instances in an end-to-end multi-tasking framework. In this paper, we utilize the method in [11] because it provides accurate mask instances for objects.

2 Our Approach

In this paper, we introduce a new pipeline for performing style transfer only on parts of images. The pipeline of the basic algorithm is shown in Fig. 2:

We first map the style of the source image onto the whole target image using the style transfer algorithm as described in [1].
A semantic segmentation algorithm [11] identifies different regions in the target image, and the user selects the regions onto which transfer will occur. The user may select specific objects in images, for example a specific person or a group of people in an image, to accept the style transfer.
The target object is segmented from the style transferred image, and a Markov random field (MRF) based model is used to merge the extracted stylized object with the non-stylized background.

Note that a naive style transfer could be done by segmenting the stylized object and placing it into the non-stylized background without the MRF model. The naive transfer yields a crude preliminary result, but the solution often looks out-of-place. The MRF model described below produces a more appealing embedding of the stylized object into the background. To the best of our knowledge, this is the first paper to study a pipeline containing segmentation, style transfer, and image fusion. We describe each step below.

2.1 Style Transfer

Our algorithm is built on the style transfer algorithm of Gatys et al. [1]. Given a source image $s$ containing a prescribed style, and an target image $t,$ the algorithm recovers an image $x$ with deep features similar to the target image, but with texture information taken from $s.$ This is accomplished by solving the (highly) non-linear least-squares problem

[TABLE]

where $F(\cdot)$ is a function mapping an image onto its deep features, and $C(\cdot)$ maps an image on a covariance matrix that measures the correlations of deep features in space. For detailed construction of these operators, see [1]. This problem is solved using back-propagation on $x$ as implemented in the popular deep learning library Torch.

2.2 Instance Segmentation

We use an instance-aware semantic segmentation method [11] to generate a mask for each object instance in an image. Our interface enables a user to simply click on a semantic instance, and the image style is transferred to that instance. The instance semantic segmentation approach is built on a cascaded multi-task network using the loss function:

[TABLE]

where $\theta$ is the weight parameters of the neural network. There are three loss terms where the latter ones depend on the former ones. $L_{b}$ is the loss function of Region Proposal Networks (RPNs) introduced in [8], which generates bounding box locations and predicts their “objectness” scores $B(\theta)$ . $L_{m}$ is the loss of the second stage, where Region-of-Interest (RoI) pooling [9] is used to extract features in the predicted boxes and a binary logistic regression predicts the instance mask $M(\theta)$ . Finally, as in [12], the softmax classification loss $L_{c}$ is computed on top of concatenated pathways of masks $M(\theta)$ and boxes $B(\theta)$ , and the last stage outputs the class prediction scores $C(\theta)$ for all instances.

2.3 Using MRFs to Blend Images

To blend the targeted style transferred object (which call the foreground in this section) into the original image (background) smoothly, we use a Markov random field. We composite the stylized/foreground and original/background images by solving an optimization problem to choose among possible labels (either foreground or background) for each pixel in the image. The properties of an ideal blending are:

The boundary between stylized and original pixels should be near the original segmentation boundary.
The seams should not draw attention, i.e., the stylized object should blend smoothly into the background.

We formulate an objective function that approximately measures these properties, and then minimize it using Markov Random Field (MRF) optimization to assign a foreground/background label to each pixel. We first define a narrow band of ambiguous pixels near the foreground/background object boundary. Points outside of this ambiguous band have their labels fixed to the value assigned during the original semantic segmentation. Only points within this band will be adjusted to achieve a smooth effect.

For an ambiguous pixel with image coordinates $p=(p_{x},p_{y}),$ and label $l$ which can be either foreground or background, we define the unary potentials $U(p,l)$ as:

[TABLE]

where $c^{l}=(c^{l}_{x},c^{l}_{y})$ is the closest non-ambiguous pixel to $p$ in region $l$ . This encourages the model to select a background label for pixels lying near the boundary between background and ambiguous pixels, and a foreground label for pixels lying near the foreground side of the boundary.

The binary potential term in the MRF encourages smooth transitions between the foreground and background. Let $I_{l}(p)$ denote the intensity of the foreground image at pixel p when $l$ is the foreground label, and the background intensity when $l$ is the background label. Given two pixels, $p_{1}$ and $p_{2}$ , with respective labels $l_{1}$ and $l_{2}$ , we definite the pairwise energy

[TABLE]

This energy forces the transition between foreground/background to occur near pixels that are least effected by the stylization. Finally, we define the following energy function:

[TABLE]

where $\mathcal{N}$ contains sets of neighboring pixels. We obtain the labels $l$ by minimizing the energy $E(l)$ using gco-v3 [13].

3 Experimental Results

Since there is no prior work on style transfer using segmentation masks, we show results for a simple mask transfer scheme and for MRF based blending. As mentioned before, the mask is generated by our deep instance aware segmentation algorithm. In Fig. 3 the second column shows the input image. A simple mask transfer scheme where we overlay the style computed on the whole image onto the mask is presented in the third column. In the last column, we present results after jointly blending the input image with the stylized image. It can be clearly seen in the car image (at pixels on the top), that using an MRF based approach for blending helps to improve object level semantic style transfer. It improves results when the colors are not consistent between the image and stylized image at the semantic boundaries. This is especially noticeable near the bird’s tail.

4 Conclusion

We presented a method for transferring artistic style onto a single object within an image. The proposed method combines style transfer with semantic segmentation, and blends results together using an MRF. We presented results of the method, and demonstrate the improvement in object boundaries afforded by the MRF model. Future work will focus on ways to integrate the stages of the algorithm so that a single deep network can perform them in one shot.

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Leon A Gatys, Alexander S Ecker, and Matthias Bethge, “Image style transfer using convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2016, pp. 2414–2423.
2[2] Leon A Gatys, Alexander S Ecker, and Matthias Bethge, “A neural algorithm of artistic style,” ar Xiv preprint ar Xiv:1508.06576 , 2015.
3[3] Leon Gatys, Alexander S Ecker, and Matthias Bethge, “Texture synthesis using convolutional neural networks,” in Advances in Neural Information Processing Systems , 2015, pp. 262–270.
4[4] Leon A Gatys, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman, “Preserving color in neural artistic style transfer,” ar Xiv preprint ar Xiv:1606.05897 , 2016.
5[5] Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” ar Xiv preprint ar Xiv:1603.08155 , 2016.
6[6] Chuan Li and Michael Wand, “Combining markov random fields and convolutional neural networks for image synthesis,” ar Xiv preprint ar Xiv:1601.04589 , 2016.
7[7] Ross Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision , 2015, pp. 1440–1448.
8[8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems , 2015, pp. 91–99.