Phonology-Guided Speech-to-Speech Translation for African Languages

Peter Ochieng; Dennis Kaburu

arXiv:2410.23323·eess.AS·June 12, 2025

Phonology-Guided Speech-to-Speech Translation for African Languages

Peter Ochieng, Dennis Kaburu

PDF

Open Access

TL;DR

This paper introduces a prosody-guided speech-to-speech translation framework for African languages that leverages pause synchrony for alignment and uses diffusion models guided by semantic and speaker cues, achieving high-quality translation without transcripts.

Contribution

It presents SPaDA, a novel alignment algorithm utilizing prosodic cues, and SegUniDiff, a diffusion-based translation model guided by external semantic and speaker gradients, advancing non-autoregressive S2ST.

Findings

01

SPaDA improves alignment F1 by 3-4 points and reduces spurious matches by 38%.

02

SegUniDiff achieves BLEU of 30.3, surpassing cascade models, and reduces speaker error rate to 5.3%.

03

The BLEU suite correlates strongly with human judgments in low-resource settings.

Abstract

We present a prosody-guided framework for speech-to-speech translation (S2ST) that aligns and translates speech \emph{without} transcripts by leveraging cross-linguistic pause synchrony. Analyzing a 6{,}000-hour East African news corpus spanning five languages, we show that \emph{within-phylum} language pairs exhibit 30--40\% lower pause variance and over 3 $\times$ higher onset/offset correlation compared to cross-phylum pairs. These findings motivate \textbf{SPaDA}, a dynamic-programming alignment algorithm that integrates silence consistency, rate synchrony, and semantic similarity. SPaDA improves alignment $F_{1}$ by +3--4 points and eliminates up to 38\% of spurious matches relative to greedy VAD baselines. Using SPaDA-aligned segments, we train \textbf{SegUniDiff}, a diffusion-based S2ST model guided by \emph{external gradients} from frozen semantic and speaker encoders. SegUniDiff…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems