REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

Ishan D. Biyani; Nirmesh J. Shah; Ashishkumar P. Gudmalwar; Pankaj Wasnik; Rajiv R. Shah

arXiv:2505.20756·eess.AS·October 2, 2025

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

Ishan D. Biyani, Nirmesh J. Shah, Ashishkumar P. Gudmalwar, Pankaj Wasnik, Rajiv R. Shah

PDF

TL;DR

This paper introduces a novel data augmentation method using time-reversed speech to improve speaker representation in diffusion-based voice conversion, leading to better speaker similarity without sacrificing speech quality.

Contribution

It proposes leveraging speaker representations from reversed speech as an augmentation strategy to enhance speaker disentanglement in voice conversion models.

Findings

01

Significant improvement in speaker similarity scores.

02

Maintains high speech quality.

03

Effective augmentation strategy for diffusion-based VC.

Abstract

Speech time reversal refers to the process of reversing the entire speech signal in time, causing it to play backward. Such signals are completely unintelligible since the fundamental structures of phonemes and syllables are destroyed. However, they still retain tonal patterns that enable perceptual speaker identification despite losing linguistic content. In this paper, we propose leveraging speaker representations learned from time reversed speech as an augmentation strategy to enhance speaker representation. Notably, speaker and language disentanglement in voice conversion (VC) is essential to accurately preserve a speaker's unique vocal traits while minimizing interference from linguistic content. The effectiveness of the proposed approach is evaluated in the context of state-of-the-art diffusion-based VC models. Experimental results indicate that the proposed approach significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.