A Non-autoregressive Generation Framework for End-to-End Simultaneous   Speech-to-Speech Translation

Zhengrui Ma; Qingkai Fang; Shaolei Zhang; Shoutao Guo; Yang Feng; Min; Zhang

arXiv:2406.06937·cs.CL·October 22, 2024

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation

Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min, Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces NAST-S2X, a non-autoregressive end-to-end framework for simultaneous speech translation that reduces delay and increases decoding speed by integrating speech-to-text and speech-to-speech tasks.

Contribution

It presents a novel non-autoregressive decoder that enables concurrent generation of tokens and dynamic latency adjustment, improving over existing pipeline methods.

Findings

01

Outperforms state-of-the-art models in speech-to-text and speech-to-speech tasks.

02

Achieves less than 3 seconds delay in simultaneous interpretation.

03

Provides 28 times faster decoding in offline generation.

Abstract

Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X), which integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ictnlp/nast-s2x
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing