Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data
Sina Rashidi, Hossein Sameti

TL;DR
This paper introduces a direct Persian-English speech-to-speech translation system that leverages synthetic parallel data, discrete speech units, and self-supervised pre-training to improve translation quality in low-resource scenarios.
Contribution
It presents a novel pipeline combining discrete units and synthetic data generation to enhance direct S2ST for low-resource languages like Persian-English.
Findings
Achieved 4.6 BLEU improvement over baselines.
Constructed a sixfold larger parallel speech corpus.
Demonstrated effectiveness of synthetic data in low-resource S2ST.
Abstract
Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
