Speech to Speech Synthesis for Voice Impersonation

Bjorn Johnson; Jared Levy

arXiv:2602.16721·cs.SD·February 20, 2026

Speech to Speech Synthesis for Voice Impersonation

Bjorn Johnson, Jared Levy

PDF

Open Access

TL;DR

This paper introduces STSSN, a novel speech-to-speech synthesis model that effectively performs voice impersonation by combining speech recognition and synthesis techniques, demonstrating superior realism compared to GAN-based methods.

Contribution

The paper presents STSSN, a new model for speech-to-speech style transfer that integrates current state-of-the-art systems for improved voice impersonation.

Findings

01

STSSN generates more convincing voice impersonation than GAN-based models.

02

The model produces realistic audio samples despite some capacity limitations.

03

Benchmark results favor STSSN over comparable generative adversarial approaches.

Abstract

Numerous models have shown great success in the fields of speech recognition as well as speech synthesis, but models for speech to speech processing have not been heavily explored. We propose Speech to Speech Synthesis Network (STSSN), a model based on current state of the art systems that fuses the two disciplines in order to perform effective speech to speech style transfer for the purpose of voice impersonation. We show that our proposed model is quite powerful, and succeeds in generating realistic audio samples despite a number of drawbacks in its capacity. We benchmark our proposed model by comparing it with a generative adversarial model which accomplishes a similar task, and show that ours produces more convincing results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis