Diffusion Synthesizer for Efficient Multilingual Speech to Speech   Translation

Nameer Hirschkind; Xiao Yu; Mahesh Kumar Nandwana; Joseph Liu; Eloi; DuBois; Dao Le; Nicolas Thiebaut; Colin Sinclair; Kyle Spence; Charles Shang,; Zoe Abrams; Morgan McGuire

arXiv:2406.10223·cs.LG·June 17, 2024

Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Nameer Hirschkind, Xiao Yu, Mahesh Kumar Nandwana, Joseph Liu, Eloi, DuBois, Dao Le, Nicolas Thiebaut, Colin Sinclair, Kyle Spence, Charles Shang,, Zoe Abrams, Morgan McGuire

PDF

Open Access

TL;DR

This paper presents DiffuseST, a low-latency multilingual speech translation system that uses a novel diffusion-based synthesizer to improve audio quality and speaker similarity while maintaining fast, real-time performance.

Contribution

The paper introduces a diffusion-based synthesizer for speech translation, outperforming Tacotron in quality and efficiency, enabling faster, more natural multilingual speech translation.

Findings

01

Diffusion synthesizer improves MOS and PESQ by 23%

02

Speaker similarity increases by 5% with diffusion synthesizer

03

System runs over 5 times faster than real-time

Abstract

We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23\% each and speaker similarity by 5\% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5 $\times$ faster than real-time.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques

MethodsDiffusion