Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention
Xutai Ma, Hongyu Gong, Danni Liu, Ann Lee, Yun Tang, Peng-Jen Chen,, Wei-Ning Hsu, Phillip Koehn, Juan Pino

TL;DR
This paper introduces a direct speech-to-speech translation model that uses discrete units and a novel variational monotonic multihead attention mechanism, improving translation quality and latency tradeoffs.
Contribution
The paper proposes a new direct Simul-S2ST model with V-MMA, enabling efficient policy learning and eliminating the need for intermediate text representations.
Findings
Direct model outperforms cascaded approach in quality-latency tradeoff.
Uses discrete units learned in an unsupervised manner for speech synthesis.
V-MMA improves efficiency in simultaneous translation policy learning.
Abstract
We present a direct simultaneous speech-to-speech translation (Simul-S2ST) model, Furthermore, the generation of translation is independent from intermediate text representations. Our approach leverages recent progress on direct speech-to-speech translation with discrete units, in which a sequence of discrete representations, instead of continuous spectrogram features, learned in an unsupervised manner, are predicted from the model and passed directly to a vocoder for speech synthesis on-the-fly. We also introduce the variational monotonic multihead attention (V-MMA), to handle the challenge of inefficient policy learning in speech simultaneous translation. The simultaneous policy then operates on source speech features and target discrete units. We carry out empirical studies to compare cascaded and direct approach on the Fisher Spanish-English and MuST-C English-Spanish datasets.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
