ACE: Zero-Shot Image to Image Translation via Pretrained Auto-Contrastive-Encoder
Sihan Xu, Zelong Jiang, Ruisi Liu, Kaikai Yang, Zhijie Huang

TL;DR
This paper introduces ACE, a contrastive learning framework enabling zero-shot image-to-image translation without domain-specific training, achieving competitive results and effective style feature learning.
Contribution
ACE is the first to enable zero-shot image translation using contrastive learning, learning style features across domains without task-specific training.
Findings
Achieves competitive multimodal translation results
Enables zero-shot translation without training on translation tasks
Effective style feature learning across domains
Abstract
Image-to-image translation is a fundamental task in computer vision. It transforms images from one domain to images in another domain so that they have particular domain-specific characteristics. Most prior works train a generative model to learn the mapping from a source domain to a target domain. However, learning such mapping between domains is challenging because data from different domains can be highly unbalanced in terms of both quality and quantity. To address this problem, we propose a new approach to extract image features by learning the similarities and differences of samples within the same data distribution via a novel contrastive learning framework, which we call Auto-Contrastive-Encoder (ACE). ACE learns the content code as the similarity between samples with the same content information and different style perturbations. The design of ACE enables us to achieve zero-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCancer-related molecular mechanisms research · Generative Adversarial Networks and Image Synthesis · Mycobacterium research and diagnosis
MethodsContrastive Learning
ACE: Zero-Shot Image to Image Translation via Pretrained Auto-Contrastive-Encoder
Sihan XU*∗*
University of Michigan
Ann Arbor
Zelong Jiang*∗*
University of Michigan
Ann Arbor
Ruisi Liu*∗*
University of Illinois
Urbana-Champaign
Kaikai Yang
YanShan University
Qinhuangdao, China
Zhijie Huang
ShanghaiTech University
Shanghai, China
Abstract
Image-to-image translation is a fundamental task in computer vision. It transforms images from one domain to images in another domain so that they have particular domain-specific characteristics. Most prior works train a generative model to learn the mapping from a source domain to a target domain. However, learning such mapping between domains is challenging because data from different domains can be highly unbalanced in terms of both quality and quantity. To address this problem, we propose a new approach to extract image features by learning the similarities and differences of samples within the same data distribution via a novel contrastive learning framework, which we call Auto-Contrastive-Encoder (ACE). ACE learns the content code as the similarity between samples with the same content information and different style perturbations. The design of ACE enables us to achieve zero-shot image-to-image translation with no training on image translation tasks for the first time.
Moreover, our learning method can learn the style features of images on different domains effectively. Consequently, our model achieves competitive results on multimodal image translation tasks with zero-shot learning as well. Additionally, we demonstrate the potential of our method in transfer learning. With fine-tuning, the quality of translated images improves in unseen domains. Even though we use contrastive learning, all of our training can be performed on a single GPU with the batch size of 8. Our code is available at github.com/SihanXU/ACE.
1 Introduction
In the field of computer vision, image-to-image translation has been well-established and achieved promising results on various related tasks such as image colorization, style transfer. These existing works are usually achieved by learning the mapping between the source and target domain [20, 40, 23, 18]. Some new training methods have emerged afterwards [7, 30, 2, 24, 1, 31], but they are still trained for learning the mapping relationship, which keeps them from focusing on the distribution of samples. However, such works are inherently fastidious in data distributions. For example, pix2pix requires the data of two domains to come in pairs [20], while cycleGAN-like methods rely on the joint distribution of two domains[40, 23, 18]. Some new methods, such as [24, 7, 1], still have strict requirements for data. Meanwhile, one-shot image translation (OST) [2] has only achieved limited breakthroughs with the same idea of learning the mapping. On the other hand, most other works have focused more on how to improve the quality of generated images [30, 31]. As a consequence, they still fail to move the eyes off the mapping relations to other approaches. Conversely, we believe that we can solve the problem of strict data requirements if there is a training method that can achieve the image translation task without learning the mapping between different distributions.
In this paper, we propose a novel learning task (Section 3) that implements image translation by learning similar and different features of images within a data distribution (Fig. 1). We note that the features that are similar before and after image translation are precisely the features that need to be retained, while the features that are different under the same distribution are the features that need to be translated. In this way, our model can recognize the features to be retained or transformed by learning the similarities and differences within the distribution. Moreover, such a training method without learning the mapping relation can also translate the samples in unseen domains without being trained on image translation tasks, thereby achieving zero-shot learning as Fig. 1(b), Fig. 2(d) . Despite several previous works claiming to attain zero-shot image translation [6, 22], they merely perform style transformation within the features in the specific domain.
Based on these ideas, we propose the Auto-Contrastive-Encoder (ACE), an Auto-Encoder structure that incorporates contrastive learning (Section 3). Contrastive learning provides us with the effectiveness of learning similarities between positive samples and differences between negative samples, which encourages our model to learn the similar and different features in the same distribution. In this paper, we use a structure similar to Simple Siamese Representation Learning (SimSiam) [5] to keep the model simple while allowing the model to be trained with a small batch size (Section 4). For contrastive learning, we propose Adaptive Instance Augmentation and perform contrastive learning directly on the encoded image features. Proven by experiments, our method is able to capture the similarity and difference in image features effectively.
It is worth noting that ACE is a framework applicable to any model instead of a specific model. Our model in this article uses the VGG [32] model for the encoder, while our decoder is a convolutional network (CNN) [29] using ResNet [16]. We perform zero-shot image translation and achieve satisfactory results on the SummerWinter [18], OrangeApple [8] and Animal Face [24] datasets. Furthermore, our experiment on Animal Face [24] shows that our method also has the potential for transfer learning by pre-training on large datasets. The experiments in this paper can all be completed on a single GPU with batchsize of 8.
2 Related Work
Image-to-image translation. Image-to-image translation, a prevalent problem for computer vision, aims to convert an input image into another output image. It has been applied to style transfer [21, 11, 12], image denoising [37, 34, 10], and colorization [38, 39, 26]. Many methods for image translation tasks have been proposed since [20]. However, these methods are largely limited to learning the mapping between images and rely on the pairs of data in datasets for training [40, 23, 18, 7, 24, 1, 30], and thus can never realize zero-shot learning. Although there is similar research [6, 22] about zero-shot learning methods before, they only do translations in the same domain rather than implement real image translation.
This paper presents a new method of image translation by learning the similarities and differences of samples within the same distribution. Our ACE approach demonstrates the feasibility of this idea. Our method is not only able to achieve competitive results on zero-shot image-to-image translation but also applicable to various image translation tasks like multimodal translation task. Furthermore, our model has the potential for transfer learning to improve the quality of image translation tasks by pretrain and fine-tune.
Contrastive learning Contrastive learning is an efficient method for unsupervised learning. Its key idea is to learn similar features between positive samples and different features between negative samples[36]. Based on this idea, several subsequent works with great influence have come into being, such as [15, 4, 14, 5, 3]. There are some methods like [14, 5, 3] can learn similar features between positive samples even without negative samples.
We use a similar structure to SimSiam [5] in this paper, but the difference is that we augment the features of images instead of augmenting the image itself. We first use adaptive instance augmentation to augment the features of the images and implement contrastive learning subsequently. Unlike SimSiam [5], we also include a predictor in the process of using the encoder. Experiments demonstrate that our method is effective in learning similarities and difference within a distribution.
3 Method
3.1 Assumption
Fig. 3 shows the fundamental assumption of our method. Based on the effect of in-domain contrast and cross-domain contrast, we can model the content information as the in-domain difference and cross-domain similarity, and the style information as the in-domain similarity and cross-domain difference. Then, we can follow the assumption in MUNIT [18] like Fig. 4 that each image is generated from a content latent code that is shared by both domains, and a style latent code that is specific to the individual domain. For each image, our objective is to find a pair of underlying encoders and to disentangle the two latent codes and a generator to reconstruct images with these two types of codes. Suppose we have a pair of image . We are able to generate a translated image by applying the encoders and the generator, namely . Note that now is also a sample in domain with the same content as . Then, since the content encoder generates domain-invariant representations, ideally we have . Similarly, considering and are from the same domain, the constraint should also hold.
3.2 Model
Fig. 6 is an overview of our model, which is similar to MUNIT [18]. Our model consists of the encoder, content encoder, style encoder, and decoder. The content encoder contains residual blocks [16] and a predictor, forming a contrastive learning framework. In the process of training (Fig. 6(a)), first, we obtain the content and style features respectively through content encoder and style encoder. Finally, the image can be restored by the decoder. In the fine-tuning and inference process (Fig. 6(b)), the encoder obtains features of style and content images so that the style and content images can be learned by the content encoder and style encoder. After these steps, we use the decoder to restore the desired image.
We use VGG [32] as the encoder of the model as in the previous works [17, 18, 24]. Fig. 6(c) presents the learning process of content code. After acquiring the features of input images, we augment the image features with the Adaptive Instance Augmentation. Subsequently, we use a SimSiam similar structure [5] to implement contrastive learning and obtain the images’ content features. Our style encoder is a single layer CNN with Adaptive Pooling [25], which is able to preserve the global feature information of the images. Following MUNIT [18], we use an MLP to learn the AdaIN parameters from the style codes.
The content encoder includes two parts as shown in Fig. 6(c). The residual blocks are CNN with skip connection[16] and BatchNorm[19]. And the predictor is a 2-layer-MLP with a bottleneck, and it has BatchNorm between hidden layers. We don’t use the BatchNorm at the output layer.
Our decoder is composed of residual networks and a convolutional network [29] with adaptive instance normalization [17]. As stated in [18], instance normalization [35] and batch normalization would destroy the style features of the image. Therefore, we exclude these two types of normalization in our decoder.
3.3 Adaptive Instance Augmentation
According to the experimental results of using the same adaptive instance normalization (AdaIN) [17], the instance norm would affect the style of images. Following this, [18] proposes to use MLP to dynamically produce the parameters for Instance Normalization layers from style codes. Inspired by these practices, we propose Adaptive Instance Augmentation, where we replace the parameters of AdaIN with Gaussian noises:
[TABLE]
Note that this is an augmentation in the latent space. The procedure to augment a sample is
[TABLE]
where means that instead of using style encoder, we use the variables and for the AdaIN layers. We first map into its content code and then use the randomized AdaIN decoder to reconstruct the augmented sample. This method enables us to modify the style of images while ensuring the same image content. Based on this feature augmentation, our contrastive learning method can make the content encoder insensitive to style features, so that the content feature can be effectively preserved.
3.4 Loss Function
Originating from our assumptions, we first design our loss function to capture the similarity between the content codes with different style features. Let and be two latent codes extracted by content encoder and augmented with Adaptive Instance Augmentation. We define the loss for contrastive learning as the SimSiam loss from [5]:
[TABLE]
where is the predictor layer and can be any distance measurement such as negative consine similarity. The content consistency is also forced by minimizing the distance between content codes extracted from each pair of original sample and reconstructed sample :
[TABLE]
Similarly, for style codes we have
[TABLE]
Next, to train the auto-encoder, we adopt a reconstruction loss and a GAN loss to ensure that the reconstructed images follow the distribution of target domain.
[TABLE]
We train our model with the total objective as the weighted sum of all loss functions mentioned above.
3.5 Stop gradient
Since Auto-Encoder updates encoder in the training, it will greatly influence the effect of contrastive learning. Therefore, when we are training the Auto-Encoder, we freeze the content encoder. Which means
[TABLE]
3.6 Discriminator
We use an approach similar to Generative Adversarial Network (GAN) [13] to train our ACE to improve the quality of the images. In our experiments, we use a loss function similar to [27] to make the training more stable; and use SpectralNorm [28] to enable the model to generate images with higher quality.
4 Experiments
4.1 Implementation Details
Our framework is comprised of a VGG encoder, a content encoder, a style encoder and a decoder. The content encoder consists of four residual blocks and the style encoder contains a global pooling layer and a fully connected layer. For the decoder, we have several residual blocks, each followed by up-sampling layers. We also use Adaptive Instance Normalization layers to dynamically generate parameters of Instance Normalization. However, to accelerate the convergence of style encoder, we propose to use two different style codes to respectively represent the global style in the domain and the individual style of each sample. The domain style code is a learnable tensor which is shared by all data in the pretrain domain, while the individual style code is output by the style encoder. Then we sum these two style codes up before applying them to the AdaIN layers.
4.2 Datasets
We conduct the evaluation on the same datasets as [18, 40]. Our method achieves satisfying results on Yosemite summerwinter, appleorange and Animal face translation (including data of bigcats, cats and dogs).
4.3 Visualization
To better understand whether our designed models work as we expect, we adopt some tools of visualization for our extracted content codes. Fig. 7 shows the information in the content codes from the pretrain domain and the unseen domain. Both representations indicate the animal’s eyes. It verifies that our contrastive learning based encoder is able to extract similar content codes regardless of the domains.
4.4 Effectiveness of Zero-shot Learning
Our experiment covers the zero-shot learning on datasets SummerWinter [18] and OrangeApple [8]. The final results are shown in Fig. 9.
At the same time, we conduct multimodal translation on the Animal Image Translation Dataset [24]. We trained our model on cat dataset and applied to bigcat2cat task and dog2cat task. With the images in Fig. 9, we can see that our method translates images to different styles while maintaining the original content features.
The focus of our work is not about producing high quality images. For this reason, we only compare our method with OST [2], FUNIT [24], MUNIT [18]. During training, our model shares the same settings as MUNIT, while OST and FUNIT keep their settings as mentioned in OST and FUNIT. The comparison is presented in Fig. 10. It is apparent that our framework obtains outstanding translating results in a zero-shot manner.
4.5 Transfer Learning
Here we discuss whether our model is suitable for transfer learning. We pre-train our model on a specific domain and test on one or two unseen domains. If we are able to get the training data of the test domains, we can conduct fine-tuning to improve the generated image quality. The process of fine-tuning is quite similar to the pre-training. We use the images in the source domain to generate the content codes and use target domain to get style codes. The fine-tuning loss will be consisted of the latent consistency losses and the GAN loss . We don’t use the contrastive loss in fine-tuning.
As shown in Fig. 11, in experiments of multimodal image translation, we pre-train our model on the cat domain and it achieves satisfactory performance when applied to the task of dog2cat translation. Next, our model still performs well if we change the target domain to the bigcats. Therefore, we believe with pre-training, our model is capable of translating the images from one unseen domain to another unseen domain. If we continue to fine-tune on the dataset of bigcats, we can see that there is much room for improvement of generated image quality. As a result, we believe our model has a great potential for transfer learning based on large datasets like ImageNet[8].
4.6 Failure Cases
Our method fails on some cases where the context is complicated such as Fig 12. For instance, on the dataset of horse2zebra, our model sometimes erroneously puts the zebra’s stripes on the background.
5 Discussion
The method proposed in this paper has achieved satisfactory results in the task of image-to-image translation, but we have not conducted experiments on other types of translation tasks. For example, we believe our method can work on language processing as well. Furthermore, due to limited resources, we only tested the potential of our method on transfer learning with small datasets. If we can pre-train on a larger data set, our model may be able to achieve better results in image translation. In this article, we just use a very simple model structure, but our approach is also applicable to other models, such as ResNet [16], Vision Transformer[9] and Diffusion Model [33]. We believe that better results can be achieved if our methods are combined with these further efforts.
6 Conclusions
In order to conquer the challenge coming from learning the mapping relationship in image-to-image translation, we propose a new objective, which is to translate images by learning the similarities and differences without learning any mappings or joint distributions. Additionally, we propose a simple model structure called Auto-Contrastive-Encoder to solve this problem, and it has achieved satisfactory results. We have also shown the potential of our model in transfer learning. It is promising that our method can make the task of image-to-image translation move forward.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Kyungjune Baek, Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Hyunjung Shim. Rethinking the truly unsupervised image-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages 14154–14163, October 2021.
- 2[2] Sagie Benaim and Lior Wolf. One-shot unsupervised cross domain translation, 2018.
- 3[3] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages 9650–9660, October 2021.
- 4[4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020.
- 5[5] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 15750–15758, June 2021.
- 6[6] Yuanqi Chen, Xiaoming Yu, Shan Liu, and Ge Li. Toward zero-shot unsupervised image-to-image translation, 2020.
- 7[7] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2018.
- 8[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition , pages 248–255, 2009.
