Cross-modal Face- and Voice-style Transfer

Naoya Takahashi; Mayank K. Singh; Yuki Mitsufuji

arXiv:2302.13838·cs.CV·March 2, 2023

Cross-modal Face- and Voice-style Transfer

Naoya Takahashi, Mayank K. Singh, Yuki Mitsufuji

PDF

Open Access

TL;DR

This paper introduces XFaVoT, a novel framework for cross-modal style transfer that jointly performs face and voice translation tasks, enabling the generation of matching face-voice pairs with improved quality and diversity.

Contribution

XFaVoT is the first unified model to perform cross-modal face and voice style transfer, effectively matching impressions across modalities and surpassing existing methods.

Findings

01

Outperforms baselines in quality and diversity

02

Achieves better face-voice correspondence

03

Effective on multiple datasets

Abstract

Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the content-creation process in many applications. However, as they are limited to the conversion within each modality, matching the impression of the generated face and voice remains an open question. We propose a cross-modal style transfer framework called XFaVoT that jointly learns four tasks: image translation and voice conversion tasks with audio or image guidance, which enables the generation of ``face that matches given voice" and ``voice that matches given face", and intra-modality translation tasks with a single framework. Experimental results on multiple datasets show that XFaVoT achieves cross-modal style translation of image and voice,…

Tables9

Table 1. Table 1: Combination of the style vector, cross-modality style consistency loss ℒ c s c subscript ℒ 𝑐 𝑠 𝑐 \mathcal{L}_{csc} , and trainable heads of the audio discriminator D a u superscript 𝐷 𝑎 𝑢 D^{au} . 𝐚 a v superscript 𝐚 𝑎 𝑣 \mathbf{a}^{av} and 𝐱 a v superscript 𝐱 𝑎 𝑣 \mathbf{x}^{av} denote audio and image from audio-visual dataset, respectively.

Style vector	$ℒ_{c s c}$	$D^{a u}$
$E_{g}^{a u} (𝐚^{a v})$	✓	$g, y$
$E_{g}^{i m} (𝐱^{a v})$	✓	$g, y$
$E_{g}^{a u} (𝐚^{a u})$		$g, y$
$E_{g}^{a u} (𝐱^{i m})$		$g$

Table 2. Table 2: Results of image translation on CelebA-HQ, GRID, and combination of LRS3 and Lip2Wav. For audio-guided image translation on CelebA-HQ, we use audio from VCTK.

Guidance	Model	CelebA-HQ (+VCTK)		GRID		LRS3+Lip2Wav
Guidance	Model	FID [ $↓$ ]	LPIPS [ $↑$ ]	FID [ $↓$ ]	LPIPS [ $↑$ ]	FID [ $↓$ ]	LPIPS [ $↑$ ]
Image	StarGANv2 [11]	27.9	0.254	34.8	0.179	72.2	0.266
Image	XFaVoT (Ours)	20.6	0.169	48.0	0.181	60.4	0.248
Audio	SGSIM [43]	148.4	0.146	121.4	0.129	161.4	0.120
Audio	XFaVoT (Ours)	38.3	0.153	28.3	0.158	74.1	0.238

Table 3. Table 3: Results of voice conversion on GRID and VCTK. ∗ indicates the average SpkSim calculated by comparing the output with 10 random utterances of the target speaker used for the reference image. SpkSim scores are not computable on VCTK+CelebA-HQ as there is no ground truth audio for the images in CelebA-HQ.

Guidance	Model	GRID			VCTK (+CelebA-HQ)
Guidance	Model	NISQA [ $↑$ ]	SpkSim [ $↑$ ]	WER [ $↓$ ]	NISQA [ $↑$ ]	SpkSim [ $↑$ ]	WER [ $↓$ ]
-	Ground Truth	4.61	0.82	0.30	4.29	0.93	0.08
Audio	AdaIN-VC[13]	3.75	0.75	0.65	2.99	0.78	0.31
	Fragment-VC [47]	3.49	0.70	0.66	3.00	0.71	0.27
	XFaVoT (Ours)	4.65	0.75	0.46	4.56	0.83	0.16
Image	AdaIN-CVC [30, 13]	3.45	0.61^∗	0.54	2.63	-	0.21
Image	XFaVoT (Ours)	4.58	0.66^∗	0.48	4.51	-	0.19

Table 4. Table 4: Ablation study on image- and audio-guided image translation tasks. Results on GRID and CelebA-HQ(+VCTK) are averaged.

Model	Image-guided		Audio-guided
Model	FID [ $↓$ ]	LPIPS [ $↑$ ]	FID [ $↓$ ]	LPIPS [ $↑$ ]
StarGANv2 [11]	54.9	0.200	-	-
$+$ Joint training	52.8	0.056	107.2	0.097
$+$ $ℒ_{c s c}$	49.3	0.048	53.7	0.049
$+$ Dual domain (=XFaVoT)	34.4	0.175	33.3	0.156
$-$ $ℒ_{c s c}$	47.0	0.057	38.5	0.036

Table 5. Table 5: Ablation study on audio- and image-guided voice conversion tasks on GRID.

Model	Audio-guided			Image-guided
Model	NISQA [ $↑$ ]	SpkSim [ $↑$ ]	WER [ $↓$ ]	NISQA [ $↑$ ]	SpkSim [ $↑$ ]	WER [ $↓$ ]
StarGANv2VC [46]+one-shot	4.62	0.64	0.448	-	-	-
$+$ Joint training	4.61	0.64	0.428	4.64	0.58	0.44
$+$ $ℒ_{c s c}$	4.62	0.64	0.430	4.69	0.58	0.44
$+$ Dual domain ( $=$ XFaVoT)	4.65	0.75	0.455	4.58	0.66	0.48

Table 6. Table 6: Image generator architecture.

Layer	Resample	Norm.	Output shape
Input	-	-	128 $\times$ 128 $\times$ 3
Conv 1 $\times$ 1	-	-	128 $\times$ 128 $\times$ 128
ResBlock	AvgPool	IN	64 $\times$ 64 $\times$ 256
ResBlock	AvgPool	IN	32 $\times$ 32 $\times$ 512
ResBlock	AvgPool	IN	16 $\times$ 16 $\times$ 512
ResBlock	AvgPool	IN	8 $\times$ 8 $\times$ 512
ResBlock	-	IN	8 $\times$ 8 $\times$ 512
ResBlock	-	IN	8 $\times$ 8 $\times$ 512
ResBlock	-	AdaIN	8 $\times$ 8 $\times$ 512
ResBlock	-	AdaIN	8 $\times$ 8 $\times$ 512
ResBlock	Upsample	AdaIN	16 $\times$ 16 $\times$ 512
ResBlock	Upsample	AdaIN	32 $\times$ 32 $\times$ 512
ResBlock	Upsample	AdaIN	64 $\times$ 64 $\times$ 256
ResBlock	Upsample	AdaIN	128 $\times$ 128 $\times$ 128
Conv 1 $\times$ 1	-	-	128 $\times$ 128 $\times$ 3

Table 7. Table 7: Audio generator architecture.

Layer	Resample	Norm.	Output shape
Input	-	-	80 $\times$ 192 $\times$ 1
Conv 1 $\times$ 1	-	-	80 $\times$ 192 $\times$ 64
ResBlock	AvgPool	IN	40 $\times$ 96 $\times$ 128
ResBlock	AvgPool	IN	20 $\times$ 96 $\times$ 256
ResBlock	AvgPool	IN	10 $\times$ 48 $\times$ 512
ResBlock	AvgPool	IN	5 $\times$ 48 $\times$ 512
ResBlock	-	IN	5 $\times$ 48 $\times$ 512
ResBlock	-	IN	5 $\times$ 48 $\times$ 512
Concat.	-	-	5 $\times$ 48 $\times$ 640
ResBlock	-	AdaIN	5 $\times$ 48 $\times$ 640
ResBlock	-	AdaIN	5 $\times$ 48 $\times$ 640
ResBlock	Upsample	AdaIN	10 $\times$ 48 $\times$ 512
ResBlock	Upsample	AdaIN	20 $\times$ 96 $\times$ 256
ResBlock	Upsample	AdaIN	40 $\times$ 96 $\times$ 128
ResBlock	Upsample	AdaIN	80 $\times$ 192 $\times$ 64
Conv 1 $\times$ 1	-	-	80 $\times$ 192 $\times$ 1

Table 8. Table 8: Image style encoder and discriminator architectures. d 𝑑 d and k 𝑘 k represent the output dimension and number of domains, respectively. We use d = 64 , k = K g formulae-sequence 𝑑 64 𝑘 subscript 𝐾 𝑔 d=64,k=K_{g} for style encoder and d = 1 , k = K g formulae-sequence 𝑑 1 𝑘 subscript 𝐾 𝑔 d=1,k=K_{g} for discriminator.

Layer	Resample	Norm.	Output shape
Input	-	-	128 $\times$ 128 $\times$ 3
Conv 1 $\times$ 1	-	-	128 $\times$ 128 $\times$ 128
ResBlock	AvgPool	IN	64 $\times$ 64 $\times$ 256
ResBlock	AvgPool	IN	32 $\times$ 32 $\times$ 512
ResBlock	AvgPool	IN	16 $\times$ 16 $\times$ 512
ResBlock	AvgPool	IN	8 $\times$ 8 $\times$ 512
ResBlock	AvgPool	IN	4 $\times$ 4 $\times$ 512
LReLU	-	-	4 $\times$ 4 $\times$ 512
Conv 4 $\times$ 4	-	-	1 $\times$ 1 $\times$ 512
LReLU	-	-	1 $\times$ 1 $\times$ 512
Linear $\times k$	-	-	$d \times k$

Table 9. Table 9: Audio style encoder, discriminator, and classifier architectures. d 𝑑 d and k 𝑘 k represent the output dimension and number of domains, respectively. We use ( d , k ) = ( 64 , K g ) , ( 1 , K ) , ( 1 , K ) 𝑑 𝑘 64 subscript 𝐾 𝑔 1 𝐾 1 𝐾 (d,k)=(64,K_{g}),(1,K),(1,K) for style encoder, discriminator, and classifier, respectively.

Layer	Resample	Norm.	Output shape
Input	-	-	80 $\times$ 192 $\times$ 1
Conv 1 $\times$ 1	-	-	80 $\times$ 192 $\times$ 64
ResBlock	AvgPool	IN	40 $\times$ 96 $\times$ 128
ResBlock	AvgPool	IN	20 $\times$ 48 $\times$ 256
ResBlock	AvgPool	IN	10 $\times$ 24 $\times$ 512
ResBlock	AvgPool	IN	5 $\times$ 12 $\times$ 512
LReLU	-	-	5 $\times$ 12 $\times$ 512
Conv 5 $\times$ 5	AvgPool	-	1 $\times$ 1 $\times$ 512
LReLU	-	-	1 $\times$ 1 $\times$ 512
Linear $\times k$	-	-	$d \times k$

Equations14

L_{a d v} = E_{x, g} [lo g D_{g}^{im} (x))] + E_{x, \tilde{g}, s} [lo g (1 - D_{\tilde{g}}^{im} (G^{im} (x, s)))] + E_{a, g, y} [lo g D_{g, y}^{a u} (x))] + E_{a, \tilde{g}, \tilde{y}, s} [lo g (1 - D_{\tilde{g}, \tilde{y}}^{a u} (G^{a u} (a, s)))],

L_{a d v} = E_{x, g} [lo g D_{g}^{im} (x))] + E_{x, \tilde{g}, s} [lo g (1 - D_{\tilde{g}}^{im} (G^{im} (x, s)))] + E_{a, g, y} [lo g D_{g, y}^{a u} (x))] + E_{a, \tilde{g}, \tilde{y}, s} [lo g (1 - D_{\tilde{g}, \tilde{y}}^{a u} (G^{a u} (a, s)))],

L_{s t y} = E_{x, \tilde{g}, s} [∣∣ s - E_{\tilde{g}}^{im} (G^{im} (x, s)) ∣ ∣_{1}] + E_{a, \tilde{g}, s} [∣∣ s - E_{\tilde{g}}^{a u} (G^{a u} (a, s)) ∣ ∣_{1}]

L_{s t y} = E_{x, \tilde{g}, s} [∣∣ s - E_{\tilde{g}}^{im} (G^{im} (x, s)) ∣ ∣_{1}] + E_{a, \tilde{g}, s} [∣∣ s - E_{\tilde{g}}^{a u} (G^{a u} (a, s)) ∣ ∣_{1}]

L_{d s} = E_{x, \tilde{g}, s_{1}, s_{2}} [∣∣ E_{\tilde{g}}^{im} (G^{im} (x, s_{1})) - E_{\tilde{g}}^{im} (G^{im} (x, s_{2})) ∣ ∣_{1}] + E_{a, \tilde{g}, s_{1}, s_{2}} [∣∣ E_{\tilde{g}}^{a u} (G^{a u} (a, s_{1})) - E_{\tilde{g}}^{a u} (G^{a u} (a, s_{2})) ∣ ∣_{1}],

L_{d s} = E_{x, \tilde{g}, s_{1}, s_{2}} [∣∣ E_{\tilde{g}}^{im} (G^{im} (x, s_{1})) - E_{\tilde{g}}^{im} (G^{im} (x, s_{2})) ∣ ∣_{1}] + E_{a, \tilde{g}, s_{1}, s_{2}} [∣∣ E_{\tilde{g}}^{a u} (G^{a u} (a, s_{1})) - E_{\tilde{g}}^{a u} (G^{a u} (a, s_{2})) ∣ ∣_{1}],

L_{cy c} = E_{x, g, \tilde{g}, s} [∣∣ x - G_{\tilde{g}}^{im} (G^{im} (x, \tilde{s}), \hat{s}^{im}) ∣ ∣_{1}] + E_{x, g, \tilde{g}, s} [∣∣ a - G_{\tilde{g}}^{a u} (G^{a u} (a, \tilde{s}), \hat{s}^{a u}) ∣ ∣_{1}],

L_{cy c} = E_{x, g, \tilde{g}, s} [∣∣ x - G_{\tilde{g}}^{im} (G^{im} (x, \tilde{s}), \hat{s}^{im}) ∣ ∣_{1}] + E_{x, g, \tilde{g}, s} [∣∣ a - G_{\tilde{g}}^{a u} (G^{a u} (a, \tilde{s}), \hat{s}^{a u}) ∣ ∣_{1}],

l_{i} = - lo g \frac{exp (⟨ s _{i}^{a u} , s _{i}^{im} ⟩ / τ )}{\sum _{j = 1}^{N} exp (⟨ s _{i}^{a u} , s _{j}^{im} ⟩ / τ )},

l_{i} = - lo g \frac{exp (⟨ s _{i}^{a u} , s _{i}^{im} ⟩ / τ )}{\sum _{j = 1}^{N} exp (⟨ s _{i}^{a u} , s _{j}^{im} ⟩ / τ )},

G, E, F min L_{a d v} + λ_{s t y} L_{s t y} - λ_{d s} L_{d s} + λ_{cy c} L_{cy c} + λ_{a sr} L_{a sr} + λ_{F 0} L_{F 0} + λ_{n or m} L_{n or m} + λ_{a d c l} L_{a d c l} + λ_{csc} L_{csc}

G, E, F min L_{a d v} + λ_{s t y} L_{s t y} - λ_{d s} L_{d s} + λ_{cy c} L_{cy c} + λ_{a sr} L_{a sr} + λ_{F 0} L_{F 0} + λ_{n or m} L_{n or m} + λ_{a d c l} L_{a d c l} + λ_{csc} L_{csc}

D, C min - L_{a d v} + λ_{c l s} L_{c l s}

D, C min - L_{a d v} + λ_{c l s} L_{c l s}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing

Full text

Cross-modal Face- and Voice-style Transfer

Naoya Takahashi

Mayank K. Singh

Yuki Mitsufuji

Sony Group Corporation, Japan

{Naoya.Takahashi, Mayank.A.Singh, Yuhki.Mitsufuji}@sony.com

Abstract

Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the content-creation process in many applications. However, as they are limited to the conversion within each modality, matching the impression of the generated face and voice remains an open question. We propose a cross-modal style transfer framework called XFaVoT that jointly learns four tasks: image translation and voice conversion tasks with audio or image guidance, which enables the generation of “face that matches given voice” and “voice that matches given face”, and intra-modality translation tasks with a single framework. Experimental results on multiple datasets shows that XFaVoT achieves cross-modal style translation of image and voice, outperforming baselines in terms of quality, diversity, and face–voice correspondence.

1 Introduction

Image-to-image translation [25] and voice conversion [31] have been widely studied due to their wide range of applications such as character creation and editing for professional content, social media content, anonymization, and avatars. Reference-guided image-to-image translation enables the modification of specific parts in a human face image, such as hairstyle and color, on the basis of the reference image [22, 42, 52, 11], while voice conversion generates a new voice that maintains the linguistic content of a source utterance [13, 47, 46]. Although image-to-image translation and voice conversion provide easy and intuitive ways to manipulate the face and voice, they are independently studied in each modality, and their relationships are largely neglected. Therefore, considerable manual effort is required to match the impression of the generated face and voice.

To address this problem, we propose XFaVoT, a single framework for cross-modal face- and voice-style transfer that enables the generation of “a face that matches a user-provided voice” and “a voice that matches a user-provided image”, as shown in Figure 1. XFaVoT jointly learns four tasks in a single framework; audio-guided image translation, image-guided image translation, audio-guided voice conversion, and image-guided voice conversion. The proposed model learns style embedding space, which is common to audio and image modalities and consistent with the face and voice from the same speakers. Image and audio generators learn to generate a new face and voice that reflect the style embedding obtained from either a reference face using an image encoder or reference voice using an audio encoder. Using face–voice pairs extracted from human talking videos, we train the proposed model on the four tasks while regularizing the consistency of the style vectors extracted from the audio and image of the same speaker via contrastive learning. To ensure both generalizability to unseen speakers and accurate modeling of speaker-dependent voice characteristics, we introduce dual-domain discriminators and a mapping network that accept two types of domain codes, gender and speaker-identity codes. To address the scarcity of a high-quality clean audio-visual dataset, we further propose leveraging unpaired audio-only and image-only datasets by switching the domain code and loss functions to appropriately incorporate the unpaired data.

Experimental results on the GRID, CelebA-HQ+VCTK, and LRS3+Lip2Wav datasets show that XFaVoT can generate high-quality images and voices that reflect given references in other modalities, outperforming baselines in the audio-guided image translation and image-guided voice conversion tasks without sacrificing the performance of the intra-modal translation tasks. Samples including audio are available on our demo page 111https://t-naoya.github.io/xfavot/ and supplemental material.

2 Related work

Image-to-image translation. Early image-to-image translation studies focus on learning mapping functions between two domains [25, 85, 48]. However, they are known to learn a deterministic mapping even with stochastic noise inputs. To improve diversity, several methods have been proposed such as marginal matching [4], latent regression [86, 23], diversity regularization [81, 52], and guidance of reference images [6, 9, 50, 60]. As these methods require separate models for each combination of two domains, the training cost becomes expensive as the number of domains increase. To address the scalability, unified frameworks have been proposed [10, 24, 49, 11]. StarGANv2 [11] learns the mappings between all available domains using a single generator. Using domain-specific branches for a discriminator and mapping network, StarGANv2 is shown to generate diverse images in the target domains. Recently, contrastive learning has been applied to enhance spatial correspondence [59, 84, 71, 27]. Several studies attempt to improve image quality by enforcing consistency in the local structure [38] and spatial perturbation [78]. Other studies focus on text-guided image manipulation [15, 57, 44, 62, 76], which aims at translating images on the basis of language. In contrast, our model translates images on the basis of voice, which provides a direct way to plausibly match the face image to the given voice.

Voice conversion. Early unsupervised voice conversion approaches, which do not require parallel data of different speakers speaking the same content, convert voices between two speakers [77, 37, 31, 67]. To improve scalability, many-to-many voice conversion models have been actively studied [8, 29, 32, 66, 46]. Li et al.[46] adopt StarGANv2 to many-to-many voice conversion and achieve state-of-the-art performance. However, the target speakers are still limited to those seen during the training. Recently, one-shot voice conversion that enables any-to-any voice conversion has been actively investigated [13, 75, 7, 47]. AdaIN-VC [13] uses a speaker encoder to extract speaker embeddings and condition the decoder using adaptive instance normalization (AdaIN) layers. Fragment-VC [47] uses a cross-attention mechanism to use fragments from reference samples to produce a converted voice. Our method can perform one-shot voice conversion in both audio- and image-guided voice conversion modes.

Cross-modal audio-visual style transfer. The most closely related work in audio-guided image translation is probably sound-guided semantic image manipulation (SGSIM) [43], which first learns an audio encoder that maps audio into the latent space of the contranstive language-image pretraining (CLIP) model then uses the encoder to search a latent space of StyleGAN2 [34] to generate an image that has a similar CLIP latent code with that obtained from the audio. Li et al. propose learning a sound-guided stylization of landscape images from unlabeled video on a hike using noise contrastive estimation (NCE) [45]. Other studies explore stylizing images on the basis of music [41, 26] by mapping the music into StyleGAN’s latent space.

On image-guided voice conversion, we are aware of only one work called cross-modal voice conversion (CVC) [30] in which a variational autoencoder (VAE)-based voice conversion model is conditioned on a face image to specify the speaker. However, that study is limited to artificial data which randomly combine a face and voice from different datasets on the basis of gender and age attributes; hence, the model fails to learn correspondence other than the attributes. Our model is based on adversarial training and learns the face-voice correspondence on real face-voice pairs while reasonably augmenting data with synthetic face-voice pairs to improve generalization and diversity.

Face–voice correspondence. Neurocognitive studies indicate that neuro-cognitive pathways for voices share a common structure with that for faces [16], and human perception implicitly recognizes the association of faces to voices [5]. Empirical studies have shown the ability of humans to associate voices of unknown individuals to pictures of their faces [28, 53]. A few studies attempt to reconstruct face from voice using a deterministic autoencoder [58] and generative adversarial network [74] and show promising results. Different from these works, one of our goals is the image translation with audio guidance which maintains the domain-invariant characteristics (e.g. pose) of the source image while translating the style to match the reference voice.

3 Proposed method

3.1 Framework

XFaVoT jointly learns the four tasks, image translation and voice conversion with audio or image guidance. The proposed model is inspired by an image translation model called StarGANv2 [11] and is largely extended to operate on multiple modalities and multiple tasks with dual-domain types. We consider two types of domains, gender $g\in\mathcal{G}=\{male,female\}$ and speaker identity $y\in\mathcal{Y}$ , which are common for audio (voice) and image (face) modalities. Figure 2 shows an overview of XFaVoT, which consists of the following eight modules.

Generators The image generator $G^{im}$ translates a source image $\mathbf{x}$ into output image $G^{im}(\mathbf{x},\mathbf{s})$ by reflecting a style provided by the style vector $\mathbf{s}$ , which is provided by either the image encoder $E^{im}$ , audio encoder $E^{au}$ or mapping network $F_{g,y}$ . When the $\mathbf{s}$ is produced by the $E^{im}$ , the task is referred to as image-guided image translation, while the $E^{au}$ is used to produce the $\mathbf{s}$ for the audio-guided image translation. Similarly, the audio generator $G^{au}$ takes as input a mel-spectrogram of the source voice $\mathbf{a}$ and outputs a mel-spectrogram $G^{au}(\mathbf{a},\mathbf{s})$ reflecting the style vector. When the style vector produced by the $E^{au}$ is used, the task results in one-shot voice conversion, which we refer to as audio-guided voice conversion for consistency. When the style vector produced by the $E^{im}$ is used, we refer to the task as image-guided voice conversion.

Style encoders The image style encoder $E^{im}_{g}$ and audio style encoder $E^{au}_{g}$ extract the style vector from a reference image and mel-spectrogram of a reference audio in domain $g$ , respectively. Both encoders have gender-domain-specific heads upon the main networks that is common to all domains.

Mapping network Given a latent code $\mathbf{z}\in\mathcal{Z}$ sampled from a prior distribution, the mapping network $F_{g,y}$ transforms $\mathbf{z}$ into the style vector $\mathbf{s}=F_{g,y}(\mathbf{z})$ . We use a single mapping network, which promotes the consistency of the style vectors between the audio and image modalities. The mapping network consists of a multi-layer perceptron with domain-specific heads for both $g\in\mathcal{G}$ and $y\in\mathcal{Y}$ .

Discriminators The image discriminator $D^{im}_{g}$ consists of a common network followed by domain-specific binary classification heads that distinguish whether the input is a real image of domain $g$ or a fake image translated by the $G^{im}$ . In contrast, the audio discriminator $D^{au}_{g,y}$ takes as an input the mel-spectrogram and distinguishes whether it is a real or fake one produced by the $G^{au}(\mathbf{a},\mathbf{s})$ . The $D^{au}_{g,y}$ has the same architecture as the $D^{im}_{g}$ except for the domain-specific heads, where the heads specific for $g\in\mathcal{G}$ and $y\in\mathcal{Y}$ are both available. We also introduce the audio classifier $C$ that predicts the source speaker of the converted voice $G^{au}(\mathbf{a},\mathbf{s})$ .

3.2 Training objectives

The aim of XFaVoT is to learn two mapping functions $G^{im}:\mathcal{X}_{g}\rightarrow\mathcal{X}_{\tilde{g}}$ that converts an image $\mathbf{x}\in\mathcal{X}_{g}$ from the source domain $g\in\mathcal{G}$ to a sample $\hat{\mathbf{x}}\in\mathcal{X}_{\tilde{g}}$ in the target domain $\tilde{g}\in\mathcal{G}$ and $G^{au}:\mathcal{A}_{g,y}\rightarrow\mathcal{A}_{\tilde{g},\tilde{y}}$ that converts a mel-spectrogram of source voice $\mathbf{a}\in\mathcal{A}_{g,y}$ from source domain $g\in\mathcal{G},y\in\mathcal{Y}$ to a sample $\hat{\mathbf{a}}\in\mathcal{A}_{\tilde{g},\tilde{y}}$ in the target domain $\tilde{g}\in\mathcal{G},\tilde{y}\in\mathcal{Y}$ . The speaker domain codes $y,\tilde{y}\in\mathcal{Y}$ can be unknown for some of the data, as discussed in Sec. 3.3. We jointly train the eight modules using audio and image data for audio- and image-guided image translation tasks and audio- and image-guided voice conversion tasks. We sample reference-domain codes $\tilde{g}$ , $\tilde{y}$ and a style vector $\mathbf{s}$ via either the image style encoder $E^{im}_{\tilde{g}}$ , audio style encoder $E^{au}_{\tilde{g}}$ , or mapping network $F_{\tilde{g},\tilde{y}}$ and train the model using the following loss functions.

Adversarial loss The audio and image generators take an input image $\mathbf{x}$ and mel-spectrogram $\mathbf{a}$ along with a style vector $\mathbf{s}$ and learn to generate a new image $G^{im}(\mathbf{x},\mathbf{s})$ and mel-spectrogram $G^{au}(\mathbf{a},\mathbf{s})$ via adversarial loss, respectively;

[TABLE]

where $\tilde{g}$ and $\tilde{y}$ are the gender and speaker identity codes of the reference speaker which corresponds to the style vector. During training, we randomly choose either gender- or speaker-specific heads to compute $D_{g,y}^{au}$

Style reconstruction loss The style reconstruction loss is used to ensure that the style code can be reconstructed from the generated samples.

[TABLE]

Style diversification loss To enable the generators to produce diverse images and audio, we employ the diversity sensitive loss [51, 81],

[TABLE]

where $\mathbf{s_{1}},\mathbf{s_{2}}\in\mathcal{S}_{\tilde{g},\tilde{y}}$ are two randomly sampled style vectors from domain $\tilde{g},\tilde{y}$ .

Cycle consistency loss To preserve the domain-invariant characteristics (e.g. pose in image and linguistic content in voice), we employ the cycle consistency loss [10]

[TABLE]

where $\hat{\mathbf{s}}^{im}=E^{im}_{g}(\mathbf{x})$ and $\hat{\mathbf{s}}^{au}=E^{au}_{g}(\mathbf{a})$ are the style vectors of the source image $\mathbf{x}$ and audio $\mathbf{a}$ , respectively.

Audio auxiliary losses Voice conversion requires maintaining the linguistic content of the source utterance while converting the voice character. To enable this, we further introduce auxiliary losses, as suggested in [46]. We use the speech consistency loss using a pretrained automatic speech recognition model $A$ as $L_{asr}=\mathbb{E}_{\mathbf{a},\mathbf{s}}[||A(\mathbf{a})-A(G^{au}(\mathbf{a},\mathbf{s}))||_{1}]$ . To facilitate learning and promote consistency in pitch and rhythm, we further introduce pitch consistency loss $L_{F0}=\mathbb{E}_{\mathbf{a},\mathbf{s}}||\bar{\mathcal{F}}(\mathbf{a})-\bar{\mathcal{F}}(G^{au}(\mathbf{a},\mathbf{s}))||_{1}$ and the norm consistency loss $L_{norm}=\mathbb{E}_{\mathbf{a},\mathbf{s}}||N(\mathbf{a})-N(G^{au}(\mathbf{a},\mathbf{s}))||_{1}$ , where $\bar{\mathcal{F}}(\mathbf{a})=\mathcal{F}(\mathbf{a})/||\mathcal{F}(\mathbf{a})||_{1}$ denotes the normalized fundamental frequency (F0) obtained using a pretrained F0 estimation network $\mathcal{F}$ , and $N(\cdot)$ is the frame-wise energy. To further facilitate speaker specific information and promote conversion, we use the adversarial classification loss. The audio source classifier $C$ is trained to identify the source speaker via classification loss $\mathcal{L}_{cls}=\mathbb{E}_{\mathbf{a},\mathbf{s}}[\operatorname{CE}(C(G^{au}(\mathbf{a},\mathbf{s})),y)]$ , and the audio generator is trained to fool the classifier via the adversarial classification loss $\mathcal{L}_{adcl}=\mathbb{E}_{\mathbf{a},\mathbf{s}}[\operatorname{CE}(C(G^{au}(\mathbf{a},\mathbf{s})),\tilde{y})]$ . where $\operatorname{CE}$ denotes the cross-entropy loss, $y$ the source speaker, and $\tilde{y}$ the reference speaker.

Cross-modality style consistency Using the common mapping network for image and audio modalities promotes the consistency of style vectors obtained from the image and audio encoders. However, it may not be sufficient to achieve speaker-identity-level consistency as the image encoder and discriminator use only the gender domain code $g$ . The style vectors obtained from the encoders with the face image and voice from the same speaker may not be close to each other. To further ensure consistency across the modalities, we use the infoNCE loss [3]. Assuming audio-image pair data, we sample pairs $(\mathbf{a}_{i},\mathbf{x}_{i}),i=1,\dots,N$ from N speakers and compute style vectors $(\mathbf{s}^{au}_{i},\mathbf{s}^{im}_{i})=(E^{au}_{g}(\mathbf{a}_{i}),E^{im}_{g}(\mathbf{x}_{i}))$ . We then compute the following loss function for the $i$ th pair

[TABLE]

where $\langle\cdot,\cdot\rangle$ denotes the cosine similarity, and $\tau$ is a temperature parameter. The cross-modality style consistency loss is obtained by the average of the loss function $L_{csc}=\frac{1}{N}\sum_{i=1}^{N}l_{i}$ .

Full objectives Our full objective function for generators can be summarized as follows:

[TABLE]

where the generator $G$ and style encoder $E$ include both audio and image modules, and $\lambda_{sty}$ , $\lambda_{ds}$ , $\lambda_{cyc}$ , $\lambda_{asr}$ , $\lambda_{F0}$ , $\lambda_{norm}$ , $\lambda_{advcls}$ and $\lambda_{csc}$ are the hyper parameters for each term.

Our full objective for discriminators is given as:

[TABLE]

3.3 Leveraging dataset in single-modality

To achieve high-fidelity generation, datasets that have high-quality image and audio are required. There are few audio-visual datasets [14, 17, 63] that provide videos of human talking with clearly visible face and clean audio without noise and reverberation. However, as such clean data require a controlled environment for recording or careful curation process, dataset size is limited, which hinders the learning of diverse audio and image generation. Although large-scale video datasets collected from the Internet, such as LRS3 [65], offer diverse videos of people speaking, faces are often blurry and in low-resolution, and the audio contains noise and reverberation. Training the model on such a distorted, unclean dataset hinders the learning of high-quality audio and image generation. To address this problem, we propose using high-quality datasets independently available in audio and image domains along with clean (possibly small) audio-visual datasets. Specifically, we use CelebA-HQ [33] for image and VCTK [79] for audio. As only the cross-modality style consistency loss $\mathcal{L}_{csc}$ requires audio-visual pair data, we omit $\mathcal{L}_{csc}$ from the full objective when we sample the style vector using an image from the image-only dataset $x^{im}$ and audio from the audio-only dataset $a^{au}$ . Since the image-only dataset does not have real audio that corresponds to the image $x^{im}$ , the speaker-identity-specific heads of the $D^{au}$ cannot be trained with the style vector produced using $x^{im}$ . In this case, we only train the gender-specific heads. (Hence, we do not require speaker label $y$ for image-only data.) The combination of the style vector source and the use of the loss functions is summarized in Table 1.

3.4 Implementation details

We base our model implementation on the offical code of StarGANv2 [11]222https://github.com/clovaai/stargan-v2 and use the same network architecture for the encoders, discriminators, and mapping network. The audio classifier has the same architecture as the discriminator. For audio representation, we use 80-band mel-spectrogram with an fast-Fourier-transform size of 2048 and hop size of 300. The generated mel-spectrograms are converted to wavefrom using the Parallel WaveGAN vocoder [80]. We provide further details in the Appendix.

4 Experiments

Our goal is to perform the cross-modal style transfer tasks (audio-guided image translation and image-guided voice conversion) while maintaining the performance of the base model [11] in the image-guided image translation task and extending StarGANv2VC [46] to the one-shot (audio-guided) voice conversion task. Thus, we evaluate the proposed method on the four tasks.

Datasets. For audio-visual data, we use three datasets, GRID [14], Lip2Wav [63], and LRS3 [65]. As the videos in LRS3 are TED-talk recordings online, their audio is mostly noisy. We manually choose 16 videos that contain relatively less noise and reverberation. The number of speakers in GRID, Lip2Wav, and LRS3 are 33, 4, and 16, respectively. We exclude four speakers from GRID for evaluating the models on unseen speakers. Face images are extracted from video frames and aligned as done in [33]. We use CelebA-HQ [33] as the image-only dataset. For audio-only data, VCTK [79], in which utterances from 109 speakers are available, is used. We exclude 30 speakers for evaluation. We resize images to 128 $\times$ 128 resolution and resample audio to 24kHz. The length of audio varies from 6 to 9 s depending on the utterance. The datasets are randomly split into 90 and 10% for training and validation sets, respectively, except CelebA-HQ, from which we extract 1000 male and female images for the validation set, as done in [11].

Baselines. To the best of our knowledge, there have been no studies investigating the four translation tasks in audio and image domain interchangeably. Therefore, we consider baselines for each task. For the image-guided image translation task, we use StarGANv2 [11] since our model shares the same architecture in this task. For the audio-guided image translation task, we use SGSIM [43] by following the official implementation [2]. For the audio-guided voice conversion task, we use two state-of-the-art one-shot voice conversion approaches, namely, AdaIN-VC [13] and Fragment-VC [47]. Finally, we consider CVC [30] for the image-guided voice conversion task. However, when we train the original CVC model on our dataset, we obtained unsatisfactory results, possibly due to the fact that our datasets are more complex and diverse than the artificial dataset used in the original study [30]. After exploration, we found that replacing the network architecture with that used in AdaIN-VC [13] improved voice quality; thus, we use this model as a baseline and refer to as AdaIN-CVC. More details can be found in the supplementary material.

Evaluation metrics. We evaluate the visual quality and diversity of generated images using Frechét inception distance (FID) [20] and learned perceptual image patch similarity (LPIPS) [83]. Following the evaluation protocol in [11], we compute FID and LPIPS for the pairs of gender domains (female $\leftrightarrows$ male) within a dataset and report their average values. We translate each test image from a source domain into a target domain using ten reference images or voices randomly sampled from the test set of a target domain. (The details on evaluation metrics and protocols are further described in supplementary material.) To evaluate generated voices, we use three metrics; NISQA [56], speaker similarity (SpkSim), and word error rate (WER). NISQA is a neural network model that predicts the overall mean opinion score of the naturalness of generated speech. SpkSim is the cosine similarity of d-vectors extracted from a speaker verification model [1], which is commonly used to evaluate the speaker similarity of samples generated using voice conversion systems. The WER is obtained using the end-to-end speech recognition system [35] provided in ESPNet [73].

4.1 Image translation

We first evaluate the proposed model on the image- and audio-guided image translation tasks. Table 2 summarizes the results on CelebA-HQ, GRID, and the combination of LRS3 and Lip2Wav. In the image-guided task, XFaVoT achieves competitive results to StarGANv2. Note that we do not aim at improving the image-guided image translation performance of StarGANv2, but rather achieving audio-guided image translation with competitive performance as the image-guided task. Among the different datasets, both models obtain the lowest FID score on CelebA-HQ and the highest score on LRS3+Lip2Wav. This tendency is consistent with the image quality of the datasets; CelebA-HQ provides the highest quality while LRS3+Lip2Wav provides the lowest.

In the audio-guided image translation task, XFaVoT significantly outperforms SGSIM and achieves competitive results as the image-guided case. This suggests that XFaVoT successfully learns intra- and cross-modal image style translation without compromising the state-of-the-art image generation quality of StarGANv2. Figure 3 shows that XFaVoT can generate more natural images by reflecting the style of voice including gender and age. The styles given by the voice are consistent across difference source images, as shown in Figure 4. Samples with audio and an ablation study can be found in the supplemental material.

4.2 Voice conversion

Next, we evaluate the proposed method on the two voice conversion tasks. For the audio-guided voice conversion task, we generate 140 voices using randomly sampled source and reference audio pairs from GRID and VCTK. We omit the evaluation on LRS3 and Lip2Wav due to the lack of transcriptions required to calculate the WER; however, perceptual quality is very similar to the results on the other datasets. The results are shown in Table 3. XFaVoT outperforms the one-shot voice conversion baselines for all metrics on both datasets, indicating the effectiveness of the StarGANv2-based framework extended for one-shot voice conversion.

For the image-guided voice conversion task, we generate 180 voices using randomly sampled source audio and reference images from GRID. We also evaluate the models on VCTK, which contains only audio data, by using randomly sampled images from CelebA-HQ as reference. XFaVoT outperforms the AdaIN-CVC baseline on both datasets. We encourage readers to refer to our anonymous demo page ${}^{\ref{fn:demo}}$ for audio samples.

4.3 Face–voice correspondence

In this section, we evaluate how well the generated voices and face images correspond to the reference on cross-modal translation tasks, namely, audio-guided image translation and image-guided voice conversion. As we aim to evaluate not only high-level correspondence such as gender but also subjective impressions (e.g. age, physical constitution, energetic/cool, etc.) without access to these labels, we conducted a subjective evaluation. We create triplets ( $x^{A},x^{B},a$ ) consisting of two face images $x^{A},x^{B}$ and a voice $a$ and ask evaluators which of the $x^{A}$ or $x^{B}$ corresponds to the voice. For audio-guided image translation, we generate two images $\hat{x}_{p}$ and $\hat{x}_{n}$ for each source image $x^{i}$ using two reference voices $a_{p},a_{n}$ of different speakers and create the triplets ( $\hat{x}_{p},\hat{x}_{n},a_{p}$ ). For image-guided voice conversion, we generate two converted voices $\hat{a}_{p},\hat{a}_{n}$ for each source voice using two images $x_{p},x_{n}$ of different speakers and create the triplets ( $x_{p},x_{n},\hat{a}_{p}$ ). The two images of each triplets are randomly shuffled and shown to evaluators. Hence, the ratio of evaluators choosing the image $\hat{x}_{p}$ and $x_{p}$ , which we call the correspondence preference ratio (CPR), indicates how well the image and voice correspond to each other. As an upper baseline, we also evaluate ground truth samples by creating the triplets ( $x_{p},x_{n},a$ ) using an image $x_{p}$ and voice $a_{p}$ from the same speaker as positive samples and a randomly sampled image $x_{n}$ from a different speaker as the negative sample. We create 30 triplets for each model and each task, and 23 evaluators participate in the test.

Figure 5 shows the CPRs on GRID and CelebA-HQ+VCTK. XFaVoT outperforms baselines on both audio-guided image translation and image-guided voice conversion tasks. The CPR of the proposed method is close to that of ground truth (GT) on the GRID dataset, suggesting that the face-voice correspondence of generated images and voices is sufficiently high (There is no ground truth for the VCTK+CelebA-HQ dataset as CelebA-HQ and VCTK are independent datasets on image and audio modality, respectively). The results on the CelebA-HQ+VCTK dataset highlight how well the model generalizes to the combination of datasets in different modalities since the cross-modality style consistency loss $\mathcal{L}_{csc}$ is not applicable on this combination during the training. Our method achieves a high CPR even in this setting.

4.4 Latent-guided face and voice generation

As XFaVoT learns the mapping network that maps latent code to the style vector, we can sample a face and voice style from the latent distribution and perform latent-guided face translation and voice conversion. As illustrated in Figure 6, the proposed model generates diverse faces and voices without reference from the source face and voice.

4.5 Application

One of applications of our model is avatar image generation based on voice. For example, users can choose an avatar image from a few preset characters and personalize it using their voice as a reference of the audio-guided image translation mode of the proposed model trained on avatar images. In Figure 7a, we show the results of the proposed model trained on the stylized images using JoJoGAN [12]. Another application is to convert user’s voice to match a user-selected avatar image, as illustrated in Figure 7b.

5 Discussion and Conclusion

We proposed a unified image translation and voice conversion framework with audio and image guidance that converts a face image and voice to plausibly match a user-provided reference in other modalities. Experimental results indicate that our model outperforms baselines in the audio-guided image translation and image-guided voice conversion tasks in terms of quality, image diversity, and face–voice correspondence. One limitation of the proposed method can be a bias in the training data. For example, the dataset used in this study does not contain many faces and voices of children. Hence, our model still produces faces or voices of adults when we provide faces or voices of a child as a reference. Increasing the diversity of the dataset is for our future work.

Appendix A Training details

We train our model for 20,000 steps with a batch size of 32. The training time is about 2.5 days on a single NVIDIA RTX A6000 GPU with our implementation in PyTorch [61]. We set $\lambda_{sty}=1$ , $\lambda_{ds}=2$ , $\lambda_{cyc}=1$ , $\lambda_{asr}=20$ , $\lambda_{F0}=5$ , $\lambda_{norm}=1$ , $\lambda_{advcls}=0.05$ , $\lambda_{csc}=1$ , $\lambda_{cls}=0.05$ , and $\tau=1$ . We use Adam [36] optimizer with $\beta_{1}=0$ and $\beta_{2}=0.99$ for the image generator $G^{im}$ , image style encoder $E^{im}$ , image discriminator $D^{im}$ , and mapping network $F$ , as done in [11], while we use AdamW for the rest of the audio modules, as done in [46]. The learning rates for $F$ is set to $10^{-6}$ , while that of other modules are set to $10^{-4}$ . The weights of all modules are initialized using He initialization [18] and set the biases of $G^{im},E^{im}$ , and $D^{im}$ to zero, except for the biases associated with the scaling vectors of AdaIN that are set to one. We employ exponential moving averages over parameters [82, 33] for $G^{au},G^{im},E^{au},E^{im}$ , and $F$ . During training, we randomly crop 2.47 s audio, which corresponds to 192 frames in mel-spectrogram, from randomly sampled audio clips. For inference, we use entire audio clip (from 6 to 9 s), as our audio generator and style encoders are fully convolutional networks and can accept an arbitrary length of audio.

Appendix B Evaluation protocol

This section provides details for the evaluation metrics and evaluation protocols used in all experiments. We follow the evaluation protocol used in StarGANv2 [11] for the evaluation of the image quality and diversity.

Frechét inception distance (FID) [20] measures the discrepancy between two groups of images. We use the outputs of the final average pooling layer of the Inception-V3 [64] pretrained on ImageNet dataset. We translate each test image from a source domain into a target domain using 10 reference samples (i.e. 10 images for the image-guided task and 10 utterances for the audio-guided task) randomly sampled from the test set of a target domain and reference modality. Then, we calculate FID between the translated images and training images in the target domain. We calculate the FID values for every pair of gender domains (i.e. female $\rightarrow$ male and male $\rightarrow$ female ) and report the average value.

Learned perceptual image patch similarity (LPIPS) [83] measures the diversity of generated images using the L1 distance between features extracted using the AlexNet [39] pretrained on ImageNet dataset. For each test image from a source domain, we generate 10 outputs of a target domain using 10 randomly sampled reference images or utterances. We then calculate the average of the pairwise distances among all outputs generated from the same source image (i.e. 45 pairs). We report the average of the LPIPS values over all test images.

NISQA [55, 56] is a neural network-based speech quality prediction model that estimates the mean opinion score of human evaluations on the naturalness of synthesized speech. The highest score is five and the lowest score is one. For audio-guided voice conversion, we generate 140 voices using randomly sampled source and reference audio pairs and compute the NISQA scores using the model provided on the official website333https://github.com/gabrielmittag/NISQA. For image-guided voice conversion, we generate 180 voices using randomly sampled source audio and reference images. Finally, we compute the average of NISQA scores.

Speaker similarity (SpkSim) is the cosine similarity of d-vectors [70] extracted from a speaker verification model444https://github.com/resemble-ai/Resemblyzer. For audio-guided voice conversion, we compute the SpkSim scores for each converted voice by comparing the converted voice with the reference voice used for the conversion and report the average value. For the evaluation of image-guided voice conversion on GRID, we compare each converted voice with 10 random utterances of the target speaker used for the reference image, and average the scores.

Word error rate (WER) measures the intelligibility of converted voice using automatice speech recognition (ASR). We use a joint CTC-attention based end-to-end ASR system [35] provided in ESPNet toolkit [73].

Appendix C Ablation study

We evaluate individual components we newly introduced in XFaVoT. For image translation tasks, we start from our base model, StarGANv2 [11], and cumulatively add each component. Table 4 shows the FID and LPIPS for several configurations. When we add audio modules with the same domain codes as image modules (i.e. gender domain) and jointly train the model for the four tasks without the cross-modality style consistency loss $\mathcal{L}_{csc}$ , we obtain the slight improvement on FID with image guidance. However, we observe a mode collapse for image-guided image translation and obtained low LPIPS. Interestingly, we do not observe the severe mode collapse with audio guidance and obtain higher LPIPS than the image-guided task. However, FID becomes high with audio guidance, indicating low image quality. By adding the $\mathcal{L}_{csc}$ , we observe improvements on FID and degradation on LPIPS for both image- and audio-guided tasks. When we introduce the proposed dual domain types (gender and speaker identity) for the discriminators and mapping network, the model becomes our proposed XFaVoT, and FID and LPIPS are significantly improved for both image- and audio-guided tasks. Finally, we also evaluate the XFaVoT trained without the $\mathcal{L}_{csc}$ and observe that FIDs and LPIPSs are significantly degraded. This results indicate that both $\mathcal{L}_{csc}$ and dual domain codes are necessary for achieving high quality and diversity for the cross- and intra-modal image translation tasks.

For voice conversion tasks, we start from a modified version of StarGANv2-VC [46], where we extend StarGANv2-VC to one-shot voice conversion by introducing gender-domain specific heads for the style encoder, mapping network, and discriminators. Similar to the evaluation of the image translation tasks, we then add the join training of image modules, the $\mathcal{L}_{csc}$ , and the dual domain codes, cumulatively. The results are summarized in Table 5. Overall, we observe similar NISQAs and WERs for all configurations, however, SpkSim scores are clearly improved when we introduce the dual domain codes. The results indicates that using the dual domain code is essential for achieving high speaker similarity with the reference while enabling the one-shot voice conversion.

Appendix D Additional results

We provide additional results of audio-guided image translation and image-guided voice conversion in Figure 8 and Figure 9, respectively.

Appendix E Visualization of learned style embedding space

We visualize the learned style embedding space using t-SNE [69]. In Figure 10, we plot the style vectors extracted from the faces and voices of the GRID speakers using image and audio style encoders, respectively. As observed, style vectors extracted from faces (denoted with $\circ$ ) and voices (denoted with $\star$ ) of the same speaker (denoted with colors) are placed closer than that of other speakers. This results indicate that the audio and image style encoders successfully learn a common style space for audio and image that capture speaker specific styles.

Appendix F AdaIN-CVC baseline

We provide the details of the AdaIN-CVC baseline. The model architecture is based on AdaIN-VC [13], and the speaker encoder is replaced with the face encoder used in CVC [30]. We base our code on the official implementation of AdaIN-VC555https://github.com/jjery2243542/adaptive_voice_conversion. We train the model by following the official implementation using the same audio-visual datasets as the proposed model, namely, GRID, LRS3, and Lip2Wav. Note that the training framework require audio-image pair data and cannot use audio-only or image-only data.

Appendix G Network architecture

We base our network architecture on StarGANv2 [11, 46].

Generators (Table 6,Table 7). The image and audio generators consist of four downsampling blocks, four intermediate blocks, and four upsampling blocks, all of which consist of preactivation residual blocks [19]. The instance normalization (IN) [68] is used for downsampling blocks, while the adaptive instance normalization (AdaIN) [21] is used for up-sampling blocks. A style code is fed to all AdaIN layers to provide scaling and shifting vectors through learned affine transformations. We remove skip connections with the adaptive wing based heatmap [72] for upsampling layers of the image generator. We concatenate F0 features extracted from the source utterance using a pre-trained joint detection and classification (JDC) F0 extraction network [40] at the midle of the intermediate blocks, as done in [46].

Mapping network consists of an MLP with $K$ output branches, where $K$ denotes the number of total domains (2 gender $+$ 128 speaker identity = 130). The MLP consists of shared four fully connected layers, followed by four domain-specific fully connected layers for each domain. The dimensions of the latent code, the hidden layer, and the style vector are set to 16, 512, and 64, respectively. The latent code is sampled from the standard Gaussian distribution.

Style encoders (Table 8, Table 9). Our style encoders consist of a CNN with $K_{g}$ output heads, where $K_{g}$ is the number of gender domains (i.e. $K_{g}=2$ ). Five and four pre-activation residual blocks are shared among all domains in image- and audio-style encoders, respectively, followed by one gender-domain specific fully connected layer. The output dimension of the domain-specific heads is 64, which is the dimension of the style vector.

Discriminators (Table 8, Table 9). Multi-task discriminators [54], which contain multiple linear output branches, are used for the image and audio discriminators. Five and four pre-activation residual blocks with leaky ReLU are used for the image and audio discriminators, respectively. The image discriminator has $K_{g}$ fully-connected layers for real/fake classification of each gender domain, while the audio discriminator use $K=K_{g}+K_{y}$ fully-connected layers for real/fake classification of each gender domain and speaker identity domain, where $K_{y}=128$ indicates the number of speaker identity domains. The audio classification network has the same network architecture as the audio discriminator, where the $K=K_{g}+K_{y}$ output branches are treated as a $K_{g}$ -class and $K_{y}$ -class classification heads for source gender and speaker identity classifications, respectively.

Bibliography86

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Resemblyzer. https://github.com/resemble-ai/Resemblyzer .
2[2] Sound-Guided Semantic Image Manipulation. https://github.com/kuai-lab/sound-guided-semantic-image-manipulation .
3[3] Jean-Baptiste Alayrac, Adriá Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self-supervised multimodal versatile networks. In Proc. Neur IPS , 2020.
4[4] Amjad Almahairi, Sai Rajeswar, Alessandro Sordoni, Philip Bachman, and Aaron Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. In Proc. ICML , 2018.
5[5] Pascal Belin, Shirley Fecteau, and Catherine Bedard. Thinking the voice: neural correlates of voice perception. Trends in cognitive science , 8, 2004.
6[6] Huiwen Chang, Jingwan Lu, Fisher Yu, and Adam Finkelstein. Pairedcyclegan: Asymmetric style transfer for applying and removing makeup. In Proc. CVPR , 2018.
7[7] Yen-Hao Chen, Da-Yi Wu, Tsung-Han Wu, and Hung yi Lee. Again-vc: A one-shot voice conversion using activation guidance and adaptive instance normalization. In Proc. ICASSP , 2021.
8[8] Ju chieh Chou, Cheng chieh Yeh, Hung yi Lee, and Lin shan Lee. Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. In Proc. Interspeech , 2018.