Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs

Lalaram Arya; Mrinmoy Bhattacharjee; Adarsh C. R.; and S. R. Mahadeva Prasanna

arXiv:2601.16023·eess.AS·January 23, 2026

Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs

Lalaram Arya, Mrinmoy Bhattacharjee, Adarsh C. R., and S. R. Mahadeva Prasanna

PDF

Open Access

TL;DR

This paper introduces DS2ST-LM, a scalable, multilingual direct speech-to-speech translation framework that leverages a large language model and timbre control to improve translation quality, stability, and speaker identity preservation across multiple languages.

Contribution

The work presents a novel, scalable direct S2ST system integrating a multilingual LLM, synthetic data augmentation, and timbre-aware synthesis, advancing multilingual capabilities and speaker preservation.

Findings

01

Outperforms traditional cascaded and baseline systems in BLEU, METEOR, BLEURT, and COMET metrics.

02

Effectively extends to multiple languages including French, Spanish, German, Hindi, Bengali, and Urdu.

03

Timbre-aware synthesis improves speaker similarity and naturalness.

Abstract

Direct Speech-to-Speech Translation (S2ST) has gained increasing attention for its ability to translate speech from one language to another, while reducing error propagation and latency inherent in traditional cascaded pipelines. However, existing direct S2ST systems continue to face notable challenges, including instability in semantic-acoustic alignment when parallel speech data is scarce, difficulty in preserving speaker identity, and limited multilingual scalability. In this work, we introduce DS2ST-LM, a scalable, single-stage direct S2ST framework leveraging a multilingual Large Language Model (LLM). The architecture integrates a Whisper speech encoder, a learnable projection module, a Qwen2-0.5B LLM, and a timbre-controlled vocoder. We construct GigaS2S-1000, a 1000-hour bilingual corpus by extending the GigaST dataset with high-fidelity synthetic target speech, and show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research