Textless and Non-Parallel Speech-to-Speech Emotion Style Transfer
Soumya Dutta, Avni Jain, Sriram Ganapathy

TL;DR
This paper introduces S2S-ZEST, a zero-shot, textless speech-to-speech emotion style transfer framework that effectively transfers emotional attributes while preserving content and speaker identity, outperforming prior methods.
Contribution
The paper presents a novel zero-shot emotion style transfer framework using an analysis-synthesis pipeline that does not require parallel data or text, improving style transfer performance.
Findings
Enhanced emotion transfer accuracy compared to prior methods
Effective preservation of speaker identity and content
Applicable for data augmentation in emotion recognition
Abstract
Given a pair of source and reference speech recordings, speech-to-speech (S2S) emotion style transfer involves the generation of an output speech that mimics the emotion characteristics of the reference while preserving the content and speaker attributes of the source. In this paper, we propose a speech-to-speech zero-shot emotion style transfer framework, termed S2S Zero-shot Emotion Style Transfer (S2S-ZEST), that enables the transfer of emotional attributes from the reference to the source while retaining the speaker identity and speech content. The S2S-ZEST framework consists of an analysis-synthesis pipeline in which the analysis module extracts semantic tokens, speaker representations, and emotion embeddings from speech. Using these representations, a pitch contour estimator and a duration predictor are learned. Further, a synthesis module is designed to generate speech based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
