Textless and Non-Parallel Speech-to-Speech Emotion Style Transfer

Soumya Dutta; Avni Jain; Sriram Ganapathy

arXiv:2505.17655·eess.AS·March 11, 2026

Textless and Non-Parallel Speech-to-Speech Emotion Style Transfer

Soumya Dutta, Avni Jain, Sriram Ganapathy

PDF

TL;DR

This paper introduces S2S-ZEST, a zero-shot, textless speech-to-speech emotion style transfer framework that effectively transfers emotional attributes while preserving content and speaker identity, outperforming prior methods.

Contribution

The paper presents a novel zero-shot emotion style transfer framework using an analysis-synthesis pipeline that does not require parallel data or text, improving style transfer performance.

Findings

01

Enhanced emotion transfer accuracy compared to prior methods

02

Effective preservation of speaker identity and content

03

Applicable for data augmentation in emotion recognition

Abstract

Given a pair of source and reference speech recordings, speech-to-speech (S2S) emotion style transfer involves the generation of an output speech that mimics the emotion characteristics of the reference while preserving the content and speaker attributes of the source. In this paper, we propose a speech-to-speech zero-shot emotion style transfer framework, termed S2S Zero-shot Emotion Style Transfer (S2S-ZEST), that enables the transfer of emotional attributes from the reference to the source while retaining the speaker identity and speech content. The S2S-ZEST framework consists of an analysis-synthesis pipeline in which the analysis module extracts semantic tokens, speaker representations, and emotion embeddings from speech. Using these representations, a pitch contour estimator and a duration predictor are learned. Further, a synthesis module is designed to generate speech based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.