Phonology-Guided Speech-to-Speech Translation for African Languages
Peter Ochieng, Dennis Kaburu

TL;DR
This paper introduces a prosody-guided speech-to-speech translation framework for African languages that leverages pause synchrony for alignment and uses diffusion models guided by semantic and speaker cues, achieving high-quality translation without transcripts.
Contribution
It presents SPaDA, a novel alignment algorithm utilizing prosodic cues, and SegUniDiff, a diffusion-based translation model guided by external semantic and speaker gradients, advancing non-autoregressive S2ST.
Findings
SPaDA improves alignment F1 by 3-4 points and reduces spurious matches by 38%.
SegUniDiff achieves BLEU of 30.3, surpassing cascade models, and reduces speaker error rate to 5.3%.
The BLEU suite correlates strongly with human judgments in low-resource settings.
Abstract
We present a prosody-guided framework for speech-to-speech translation (S2ST) that aligns and translates speech \emph{without} transcripts by leveraging cross-linguistic pause synchrony. Analyzing a 6{,}000-hour East African news corpus spanning five languages, we show that \emph{within-phylum} language pairs exhibit 30--40\% lower pause variance and over 3 higher onset/offset correlation compared to cross-phylum pairs. These findings motivate \textbf{SPaDA}, a dynamic-programming alignment algorithm that integrates silence consistency, rate synchrony, and semantic similarity. SPaDA improves alignment by +3--4 points and eliminates up to 38\% of spurious matches relative to greedy VAD baselines. Using SPaDA-aligned segments, we train \textbf{SegUniDiff}, a diffusion-based S2ST model guided by \emph{external gradients} from frozen semantic and speaker encoders. SegUniDiff…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
