A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation
Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min, Zhang

TL;DR
This paper introduces NAST-S2X, a non-autoregressive end-to-end framework for simultaneous speech translation that reduces delay and increases decoding speed by integrating speech-to-text and speech-to-speech tasks.
Contribution
It presents a novel non-autoregressive decoder that enables concurrent generation of tokens and dynamic latency adjustment, improving over existing pipeline methods.
Findings
Outperforms state-of-the-art models in speech-to-text and speech-to-speech tasks.
Achieves less than 3 seconds delay in simultaneous interpretation.
Provides 28 times faster decoding in offline generation.
Abstract
Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X), which integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
