Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation
Nameer Hirschkind, Xiao Yu, Mahesh Kumar Nandwana, Joseph Liu, Eloi, DuBois, Dao Le, Nicolas Thiebaut, Colin Sinclair, Kyle Spence, Charles Shang,, Zoe Abrams, Morgan McGuire

TL;DR
This paper presents DiffuseST, a low-latency multilingual speech translation system that uses a novel diffusion-based synthesizer to improve audio quality and speaker similarity while maintaining fast, real-time performance.
Contribution
The paper introduces a diffusion-based synthesizer for speech translation, outperforming Tacotron in quality and efficiency, enabling faster, more natural multilingual speech translation.
Findings
Diffusion synthesizer improves MOS and PESQ by 23%
Speaker similarity increases by 5% with diffusion synthesizer
System runs over 5 times faster than real-time
Abstract
We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23\% each and speaker similarity by 5\% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5 faster than real-time.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques
MethodsDiffusion
