# Cross-modal Face- and Voice-style Transfer

**Authors:** Naoya Takahashi, Mayank K. Singh, Yuki Mitsufuji

arXiv: 2302.13838 · 2023-03-02

## TL;DR

This paper introduces XFaVoT, a novel framework for cross-modal style transfer that jointly performs face and voice translation tasks, enabling the generation of matching face-voice pairs with improved quality and diversity.

## Contribution

XFaVoT is the first unified model to perform cross-modal face and voice style transfer, effectively matching impressions across modalities and surpassing existing methods.

## Key findings

- Outperforms baselines in quality and diversity
- Achieves better face-voice correspondence
- Effective on multiple datasets

## Abstract

Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the content-creation process in many applications. However, as they are limited to the conversion within each modality, matching the impression of the generated face and voice remains an open question. We propose a cross-modal style transfer framework called XFaVoT that jointly learns four tasks: image translation and voice conversion tasks with audio or image guidance, which enables the generation of ``face that matches given voice" and ``voice that matches given face", and intra-modality translation tasks with a single framework. Experimental results on multiple datasets show that XFaVoT achieves cross-modal style translation of image and voice, outperforming baselines in terms of quality, diversity, and face-voice correspondence.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.13838/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/2302.13838/full.md

## References

86 references — full list in the complete paper: https://tomesphere.com/paper/2302.13838/full.md

---
Source: https://tomesphere.com/paper/2302.13838