Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

Sina Rashidi; Hossein Sameti

arXiv:2511.12690·cs.CL·November 18, 2025

Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

Sina Rashidi, Hossein Sameti

PDF

Open Access

TL;DR

This paper introduces a direct Persian-English speech-to-speech translation system that leverages synthetic parallel data, discrete speech units, and self-supervised pre-training to improve translation quality in low-resource scenarios.

Contribution

It presents a novel pipeline combining discrete units and synthetic data generation to enhance direct S2ST for low-resource languages like Persian-English.

Findings

01

Achieved 4.6 BLEU improvement over baselines.

02

Constructed a sixfold larger parallel speech corpus.

03

Demonstrated effectiveness of synthetic data in low-resource S2ST.

Abstract

Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling