S-Flow GAN

Yakov Miron; Yona Coscas

arXiv:1905.08474·cs.CV·September 26, 2019

S-Flow GAN

Yakov Miron, Yona Coscas

PDF

Open Access

TL;DR

This paper introduces S-Flow GAN, a conditional GAN architecture that translates semantic label maps and CG edge maps into photo-realistic images, with extensions for video generation, enhancing realism and temporal coherence.

Contribution

The paper presents a novel GAN architecture for domain translation from semantic and edge maps to realistic images, including a new video extension for temporal coherence.

Findings

01

Effective translation from semantic maps to realistic images.

02

Enhanced photo-realism in generated images.

03

Temporal coherence in video generation achieved.

Abstract

Our work offers a new method for domain translation from semantic label maps and Computer Graphic (CG) simulation edge map images to photo-realistic images. We train a Generative Adversarial Network (GAN) in a conditional way to generate a photo-realistic version of a given CG scene. Existing architectures of GANs still lack the photo-realism capabilities needed to train DNNs for computer vision tasks, we address this issue by embedding edge maps, and training it in an adversarial mode. We also offer an extension to our model that uses our GAN architecture to create visually appealing and temporally coherent videos.

Tables3

Table 1. Table 1: semantic segmentation results on the cityscapes [ 8 ] validation set

Cityscapes	Pix2pix	Pix2pixHD	Ours	Oracle
Pixel accuracy [%]	0.7279	0.81	0.83	0.86
Mean IoU [%]	0.5324	0.67	0.69	0.701

Table 2. Table 2: semantic segmentation results on the Synthia [ 26 ] dataset

Synthia	Pix2pix	Pix2pixHD	Ours	Oracle
Pixel accuracy [%]	0.54	0.79944	0.860753	0.913132
Mean IoU [%]	0.36	0.55955	0.740040	0.8419

Table 3. Table 3: FID and FVD metric comparisson between pix2pix, pix2pixHD vid2vid and Ours.

FID,FVD	Pix2pix	Pix2pixHD	Vid2vid	Ours-img	Ours-vid
FID	116.69	71.21	154.36	69.25	69.81
FVD	-	-	0.706	-	0.326

Equations18

\ \min_{G}\ \max_{D}\mathcal{L}_{GAN}(D,G)\

\ \min_{G}\ \max_{D}\mathcal{L}_{GAN}(D,G)\

L_{G A N (D, G)} = E_{(x, s)} [l o g (x, s)] + E_{(s \sim p_{d a t a} (s))} [l o g (1 - D (s, G (s)))]

L_{G A N (D, G)} = E_{(x, s)} [l o g (x, s)] + E_{(s \sim p_{d a t a} (s))} [l o g (1 - D (s, G (s)))]

L_{G A N (D, G, e)} = E_{(x, s)} [l o g (x, s)] + E_{((s, e) \sim p_{d a t a} (s, e))} [l o g (1 - D (s, G (s, e)))]

L_{G A N (D, G, e)} = E_{(x, s)} [l o g (x, s)] + E_{((s, e) \sim p_{d a t a} (s, e))} [l o g (1 - D (s, G (s, e)))]

L_{D N E D} : = L_{D N E D} (E (x)) = i = 1 \sum N a_{i} * B C E (d_{i} (x), E (x))

L_{D N E D} : = L_{D N E D} (E (x)) = i = 1 \sum N a_{i} * B C E (d_{i} (x), E (x))

L_{F M_{m}}^{k} : = L_{F M_{m}}^{k} (D_{k}, G, e) = i = 1 \sum T \frac{1}{N _{i}} E_{(x, s, e) \sim p_{d a t a} (s, x, e)} L_{1} (D_{k}^{i} (s, x) - D_{k}^{i} (s, G (s, e))

L_{F M_{m}}^{k} : = L_{F M_{m}}^{k} (D_{k}, G, e) = i = 1 \sum T \frac{1}{N _{i}} E_{(x, s, e) \sim p_{d a t a} (s, x, e)} L_{1} (D_{k}^{i} (s, x) - D_{k}^{i} (s, G (s, e))

L_{p er ce p} : = L_{p er ce p} (x, G (s, e)) = \frac{1}{P} i = 1 \sum P L_{1} (F L_{V G G_{i}} (x) - F L_{V G G_{i}} (G (s, e)))

L_{p er ce p} : = L_{p er ce p} (x, G (s, e)) = \frac{1}{P} i = 1 \sum P L_{1} (F L_{V G G_{i}} (x) - F L_{V G G_{i}} (G (s, e)))

L_{C G 2 r e a l} = G min D_{k}, k = 1 : l_{m} max l = 1 \sum l_{m} L_{G A N} (D_{k}, G, e) + λ_{1} l = 1 \sum l_{m} L_{F M_{m}}^{k} + λ_{2} L_{p er ce p} + λ_{3} L_{N N E D}

L_{C G 2 r e a l} = G min D_{k}, k = 1 : l_{m} max l = 1 \sum l_{m} L_{G A N} (D_{k}, G, e) + λ_{1} l = 1 \sum l_{m} L_{F M_{m}}^{k} + λ_{2} L_{p er ce p} + λ_{3} L_{N N E D}

L_{f l o w} = L_{1} (F_{r e a l}, F_{f ak e})

L_{f l o w} = L_{1} (F_{r e a l}, F_{f ak e})

L_{v i d eo g e n} = L_{f l o w} + L_{C G 2 r e a l}

L_{v i d eo g e n} = L_{f l o w} + L_{C G 2 r e a l}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Digital Media Forensic Detection

MethodsConvolution · Dogecoin Customer Service Number +1-833-534-1729

Full text

S-Flow GAN

Yakov Miron , Yona Coscas

Elbit Systems Aerospace

{yakov.miron, yona.coscas}@elbitsystems.com

Abstract

Our work offers a new method for domain translation from semantic label maps and Computer Graphic (CG) simulation edge map images to photo-realistic images. We train a Generative Adversarial Network (GAN) in a conditional way to generate a photo-realistic version of a given CG scene. Existing architectures of GANs still lack the photo-realism capabilities needed to train DNNs for computer vision tasks, we address this issue by embedding edge maps, and training it in an adversarial mode 1. We also offer an extension to our model that uses our GAN architecture to create visually appealing and temporally coherent videos.

1 Introduction

The topic of image to image translation and more generally video to video translation is of major importance for training autonomous systems. It is beneficial to train an autonomous agent in real environments, but not practical, since enough data cannot be gathered [7]. However, using simulated scenes for training might lack details since a synthetic image will not be photo-realistic and will lack the variability and randomness of real images, causing training to succeed up to a certain point. This gap is also referred to as “the reality gap” [7]. By combining a non photo-realistic, simulated model with an available dataset, we can generate diverse scenes containing numerous types of objects, lightning conditions, colorization etc. [6].

In this paper, we depict a new approach to generate images from a semantic label map and a flexible Deep Convolution Neural Network (DCNN) we called Deep Neural Edge Detector (DNED) which embed edge maps. we combine embedded edge maps which act as a skeleton with a semantic map as input to our model (fig 2), The model outputs a photo-realistic version of that scene. Using the skeleton by itself will generate images that lack variability as it restricts the representation to that specific skeleton itself. Instead, we learn to represent skeletons by a neural network and at test time, we sample the closest appropriate skeleton the network has seen at training. Moreover, we have extended this idea to generate photo-realistic videos (i.e. sequence of images) with a novel loss that uses the optical flow algorithm for pixel coherency between consecutive images.

Recent works in the field of image generation include pix2pix [18] offering image generation from semantic maps, cascaded refinement networks [6] using networks refining different resolutions in a cascade manner, pix2pixHD [37] can generate HD images in a conditional manner using multi-scale discriminator and an dual generator used as a super resolution generator. L1 loss for image generation is known to generate low quality images as the generated images are blurred and lack details [10]. Instead, [11], [19] are using a modified version of the perceptual loss, allowing generation of finer details in an image. Pix2pixHD [37] and CRN [6] are using a perceptual loss as well for training their networks, e.g. VGGnet [31]. Moreover, pix2pixHD are using instance maps as well as label maps to enable the generator to separate several objects of the same semantics. This is of high importance when synthesizing images having many instances of the same semantics in a single frame.

As for video generation the loss used by [36], [30] tend to be computationally expensive while our approach is simpler. We are using two generators of the same architecture, and they are mutually trained using our new optical flow based loss that is fed by dense optical flow estimation. Our evaluation method is FID [15] and FVD [34] as it is a common metric being used for image and video generation schemes. We call this work s-Flow GAN since we embed Spatial information obtained from dense optical flow in a neural network as a prior for image generation and flow maps for video coherency. This optical flow is available since the simulated image is accessible at test time in the case of CG2real scheme.

We make Three major contributions: First, our model can generate visually appealing photo-realistic images from semantic maps having high definition details. Second, we incorporate a neural network to embed edge maps, thus allowing generation of diverse versions of the same scenes. Third, we offer a new loss function for generating natural looking videos using the above mentioned image generation scheme. please refer to this link for videos and comparison to related work.

2 Related Work

2.1 Generative Adversarial Networks

Generative Adversarial Networks (GAN) were introduced in 2014 [13]. This method generate images that look authentic to human observers. They do so by having two neural networks, one generating candidates while the other acts as a critique and tries to evaluate the generation quality [2],[25],[42],[43],[28]. GANs are widely used for image generation; some image synthesis schemes are used to generate low resolution images e.g. 32x32 [18] while [4] were able to generate higher resolution images (up to 512x512). In addition, [37] were able to generate even higher resolution images using coarse-to-fine generators. The reason generating high resolution images is challenging is the high dimensionality of the image generation task and the need to provide queues for high resolution [24], [20]. We offer queues as an edge map skeletons generated by our proposed DNED module. During training the DNED is trained to learn the representations of real image edge maps. During test the DNED is shown a CG (Computer Graphics) edge map, finds its best representation and provides the generator with an appropriate generated edge map sampled from real image edge maps distribution.

2.2 Image synthesis

2.2.1 Image to image translation

In the pix2pix setting, they used a Conditional GAN [23], where the network’s input is a semantic map of the scene, and while training in adversarial mode, a fake version of the real image is given to the discriminator to distinguish. In the CG2real setting in addition to the semantic map we also have access to the simulated image. Using the CG image as is, might be counter productive since it will be trained to reconstruct CG images and not photo-realistic ones. Conversely some of the underlying CG information correlates with the real world and can provide meaningful prior to the synthesis. Since the relevant information lies in the image high frequencies [5], we learn the distribution of edge maps in real images (high resolution details), and provide representation of it to the generator at test time. Some image generation tasks use label maps only, e.g. [18]. The label maps provide only information about the class of a given pixel. In order to generate photo-realistic images, some use instance maps as well [37], This way, they can differentiate several adjacent objects of the same class. Nonetheless, while most datasets provide object level information about classes like cars, pedestrians, etc. they do not provide that information about vegetation and buildings. As a result, the generated images might not correctly separate those adjacent objects, thus degrading photo-realism.

2.2.2 Learning edges by a neural network

Generating edge maps using neural networks is a well established method. Holistically-Nested Edge Detection (HED) provides holistic image training and prediction for multi-scale and multi-level feature learning [39]. They use a composition of generated edge maps to learn a fine description of the edge scene. Inspired by their work, we train a neural network to learn edge maps of real images.

As mentioned before, our generator requires an edge map as input. we get the edge map using a spacial Laplacian operator with threshold. Providing the generator with deterministic edge map will produce the same scene, so we train the DNED to take as input that deterministic edge map, learn its representation and produce a variant of that edge map, as a superposition of edges seen in real datasets. This way the generator will be able to produce a varaiaty of photorealistic images for the same scene.

Since our approach (using edge maps) is not class dependent, we do not need instance map information to generate several adjacent instances of the same semantics. Moreover, this approach addresses the problem of generating fine details within a class like buildings and vegetation as can bee seen in fig 5.

2.3 Video to video synthesis

Generating temporally coherent image sequences is a known challenge. Recent works use GANs to generate videos in an unconditional setting [27],[33],[35], by sampling from a random vector, but don’t provide the generator with temporal constrains, thus generating non coherent sequences of images. Other works like video matting [3] and video inpainting [38] translate videos to videos but rely on problem specific constrains and designs. A recent work named vid2vid [36] offers to conditionally generate video from video and is considered to one of the best approaches to date. Using FlowNet 2.0 [17] they predict the optical flow of the next image. In addition, they use a mask to differentiate between two parts; the hallucinated image generated from instance-level semantic segmentation masks and the predicted image from the previous frame. By adding these two parts, this method can combine the predicted details from the previously generated image, with the details from the newly generated image. Inspired by [36], we are using flow maps of consecutive images to generate temporally coherent videos. Contrary to [36] we are not using a CNN to predict the flow maps or a sequence generator, but a classical Computer vision approach. This is since a pre-trained network (trained on real datasets) failed to generalize and infer on simulated datasets e.g. Synthia. This enables better temporal coherency and improve video generation robustness.

3 Model

Our CG2real model aims to learn the conditional distribution of an image given a semantic map. Our video generation model aims to use this learned distribution for generating temporally coherent videos using the generated images from the CG2real scheme. We first depict the image generation scheme, then we review our video generation model.

3.1 Image generation

We use a conditional GAN to generate images from semantic maps as in [18]. In order to generate images, the generator receives the semantic segmentation images $s_{i}$ and maps it to photo-realistic images $x_{i}$ . In parallel, the discriminator takes two images, The real image $x_{i}$ (ground truth) and the generated image ${f_{i}}$ and learns to distinguish between them. This supervised learning scheme is trained in the well-known min max game [13],[28]:

[TABLE]

3.2 Embedding edge maps

In order to generate photo-realistic visually appealing images containing fine details, we provide a learnt representation of an edge map to the generator (fig 2), allowing it to learn the conditional distribution of real images given semantic maps and edge maps, i.e.:

[TABLE]

During training, given an example image ${x_{i}}$ , we can estimate its edge map by the well-known spatial Laplacian operator [14],[9]. This edge map is concatenated to the semantic label map and both are given as priors to the generator for adversarial training of the fake image ${f_{i}}$ vs. the real image ${x_{i}}$ . To allow a stable training we begin training our GAN with the edge maps from the Laplacian operator. After stabilization of the generator and discriminator, we provide our generator with edge maps from the DNED. We then jointly train the GAN with the DNED.

The DNED architecture is a modified version of HED [39]. In HED, they generate several sized versions of the edge map, each having a different receptive field. The purpose is to create an ensemble of edge maps, each allowing different level of details in the image. When superimposing all, the resulting edge map will have coarse-to-fine level of details in the generated edge map image. By changing the weights of that ensemble, we can generate the desired variability in the generated edge map, thus allowing us to generate diverse versions of the output. To conclude, the loss function for training the DNED is:

[TABLE]

Where: $d_{i}(x)$ , $i=0:5$ is the $i^{th}$ side output of a single scale, $E(x)$ is the classic edge map generated by the spatial Laplacian operator, BCE is the binary cross entropy loss. $N=6$ in our case. $a_{i}$ is the contribution of the $i^{th}$ scale to the ensemble.

Increasing the resolution of the image might be challenging for GAN training. In other methods the discriminator needs a large receptive field [18],[29],[31],[22], requiring a deeper network or larger convolution kernels. Using a deeper network is prone to overfitting and in the case of GAN training, and might cause training to be unstably. This challenge is usually addressed by the multi-scale approach [12],[9],[16],[20],[40]. Since the DNED embed a learnt representation of skeletons, our architecture performs very well on higher resolution images. Our original generated images were of size [512x256]. We have successfully trained our model to generate images of size [768x384] , i.e. 1.5 times larger in each dimension without changing the model while using a single discriminator (see 3).

We showed that generating high quality images when using a single discriminator is feasible and training is stable. We provide comparison using our method with multi-scale discriminator 4. the FM loss is computed with k=1 for single layer discriminator and k=3 for multi layer one:

[TABLE]

In addition, following [10],[11],[19],[44] we are using the perceptual loss for improved visual performance and to encourage the discriminator distinguish real or fake samples using a pre traind VGGnet [21].

[TABLE]

Where, P is the number of slices from a pre-trained VGG network and $FL_{VGG_{i}}$ are the features extracted by the VGG network from the $i^{th}$ layer of the real and generated images respectively. To conclude, our overall objective for generating photo-realistic, diverse images in the CG2real setting is to minimize $L_{CG2real}$ :

[TABLE]

3.3 Video generation

Using pre trained CG2real networks, we generate two consecutive images, and then estimate two flow maps. The first flow map is between $x_{i},x_{i+1}$ , where $x_{i}$ and $x_{i+1}$ are two consecutive real images. The second flow map is between $G(s_{i},e_{i}),G(s_{i+1},e_{i+1})$ , where $G(s_{i},e_{i})$ and $G(s_{i+1},e_{i+1})$ are two consecutive generated (fake) images. Note that the generation of $G(s_{i},e_{i}),G(s_{i+1},e_{i+1})$ is done independently, meaning we apply our CG2real method twice, without any modifications. To conclude we enforce temporal coherency by using the following loss:

[TABLE]

Where $\mathcal{F}_{real}=\mathcal{F}(x_{i},x_{i+1})$ , $\mathcal{F}_{fake}=\mathcal{F}(G(s_{i},e_{i}),G(s_{i+1},e_{i+1}))$ and $\mathcal{F}(*)$ is the optical flow operator. This formulation eliminates the need of using a sequential generator as in [36], allowing us not only using our image generation model twice, which adds more constrains to the video generation scheme, but also avoid errors accumulation arising from positive feedback by feeding a generated image to the generator, as can be seen in figure 7 and in this video.

By adding $L_{flow}$ to the $L_{CG2real}$ loss, the network learns to generate $G(s_{i+1},e_{i+1})$ taking the flow maps into account, thus generating temporally coherent images as depicted in 7.

[TABLE]

4 Results

Our goal is to generate photo-realistic images. In (fig 6) we can find some examples from the CG2real image synthesis task, and in (fig 8) present consecutive images depicting the video to video synthesis. We use the same evaluation methods as used by previous image to image works ,e.g. pix2pix [18] , pix2pixHD [37] and others. The evaluation process consist of performing semantic segmentation with a pre-trained seamntic segmentation network [41] on synthesized images produces by our model, then calculating the semantic pixel accuracy and the mean intersection over union (mIoU) over the classes in the dataset. As shown in tables 1, 2 bellow, our network outperforms previous works. The ground-truth results are the pixel accuracy and mIoU when performing the same semantic segmentation with the real images (Oracle).

Furthermore, to evaluate the image generation quality, we used another metric to evaluate distances between datasets called FID (Fréchet Inception Distance) [15],[1]. It is a very common metric for generative models as it correlates well with the visual quality of generated samples [36]. FID calculates the distance between two multivariate Gaussians real and generated respectively; where $X_{r}\sim N(\mu_{r},\Sigma_{r}$ ) and $X_{g}\sim N(\mu_{g},\Sigma_{g}$ ) are the 2048-dimensional activations of the Inception-v3 pool3 layer [32], and $FID=\|\mu_{r}-\mu_{g}\|^{2}+Tr(\Sigma_{r}+\Sigma_{g}-2({\Sigma_{r}\Sigma_{g}})^{1/2})$ is the score for image distributions $X_{r}$ and $X_{g}$ . Lower FID score is better, meaning higher similarity between real and generated samples.

As can be seen in tables 1, 2, pix2pixHD’s results are better than pix2pix for pixel accuracy and mIoU. Our results are better than pix2pixHD, and almost meet the oracle’s results on both Synthia [26] and cityscapes [8]. In table 3, we compare the FID score for all the four image generation models w.r.t the Oracle. Ours-img (Our image generation model) outperforms both pix2pix and pix2pixHD. Moreover, adding a temporal consistency constrain to the image generation process degrades image quality. Vid2vid uses pix2pixHD as its image generation model imposes a substantial degradation in the image quality (71.21 to 154.36). Our video generation uses our CG2real model had a marginal effect on the FID score of Ours-vid (69.25 to 69.81 and even outperformed pix2pixHD) and did not degrade generated images quality (fig 8).

Our video generation evaluation method is FVD (Fréchet Video Distance) proposed by [34]. FVD is a metric for video generation models evaluation and it uses a modified version of FID. we calculated the FVD score for our generated video (Ours-vid) w.r.t. the Oracle (real video) and did the same for vid2vid w.r.t the same Oracle. Our FVD score on the video test set is 0.326 while vid2vid’s is 0.706 meaning our videos are more than twice similar to the oracle. we suggest that this substantial margin stems from the errors accumulated in the video generation model of vid2vid (fig 8). As mentioned, Our video generation model uses our flow loss therefore does not encounter this phenomena.

5 Summary

We present a CG2real conditional image generation as well as a conditional video synthesis. We offer to use a network learning the distribution of edge maps from real images and integrate it into a generator (DNED). We were able to generate highly detailed and diverse images thus enabling better photo-realism. Using the DNED enable generating diverse yet photo-realistic realizations of the same desired scene without using instance maps. As for video generation, we offer a new scheme that utilizes flow maps allowing better temporal coherence in videos. We compared our model to recent works and found that it outperforms both current quantitative results and more importantly generates appealing images. Furthermore, our video generation model generates temporally coherent and consistent videos.

Appendix A Appendix

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adler & Lunz [2018] Jonas Adler and Sebastian Lunz. Banach wasserstein gan. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31 , pp. 6754–6763. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7909-banach-wasserstein-gan.pdf .
2Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875 , 2017.
3Bai et al. [2009] Xue Bai, Jue Wang, David Simons, and Guillermo Sapiro. Video snapcut: Robust video object cutout using localized classifiers. In ACM SIGGRAPH 2009 Papers , SIGGRAPH ’09, pp. 70:1–70:11, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-726-4. doi: 10.1145/1576246.1531376 . URL http://doi.acm.org/10.1145/1576246.1531376 .
4Brock et al. [2018] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096 , 2018.
5Burt & Adelson [1983] Peter Burt and Edward Adelson. The laplacian pyramid as a compact image code. IEEE Transactions on communications , 31(4):532–540, 1983.
6Chen & Koltun [2017] Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE International Conference on Computer Vision , pp. 1511–1520, 2017.
7Collins et al. [2018] Jack Collins, David Howard, and Jürgen Leitner. Quantifying the reality gap in robotic manipulation tasks. ar Xiv preprint ar Xiv:1811.01484 , 2018.
8Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016.