Language translation, and change of accent for speech-to-speech task   using diffusion model

Abhishek Mishra; Ritesh Sur Chowdhury; Vartul Bahuguna; Isha Pandey,; Ganesh Ramakrishnan

arXiv:2505.04639·cs.CL·May 9, 2025

Language translation, and change of accent for speech-to-speech task using diffusion model

Abhishek Mishra, Ritesh Sur Chowdhury, Vartul Bahuguna, Isha Pandey,, Ganesh Ramakrishnan

PDF

Open Access

TL;DR

This paper introduces a diffusion model-based method for simultaneous speech translation and accent adaptation, enabling high-fidelity, integrated cross-lingual and accentual speech conversion.

Contribution

It presents a novel unified diffusion-based framework that jointly handles translation and accent change in speech-to-speech tasks, improving efficiency and effectiveness.

Findings

01

Achieves high-quality speech translation with accent adaptation.

02

Outperforms traditional pipelines in parameter efficiency.

03

Demonstrates effective joint optimization of translation and accent change.

Abstract

Speech-to-speech translation (S2ST) aims to convert spoken input in one language to spoken output in another, typically focusing on either language translation or accent adaptation. However, effective cross-cultural communication requires handling both aspects simultaneously - translating content while adapting the speaker's accent to match the target language context. In this work, we propose a unified approach for simultaneous speech translation and change of accent, a task that remains underexplored in current literature. Our method reformulates the problem as a conditional generation task, where target speech is generated based on phonemes and guided by target speech features. Leveraging the power of diffusion models, known for high-fidelity generative capabilities, we adapt text-to-image diffusion strategies by conditioning on source speech transcriptions and generating Mel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Voice and Speech Disorders

MethodsDiffusion