StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task   Learning

Shaolei Zhang; Qingkai Fang; Shoutao Guo; Zhengrui Ma; Min Zhang; Yang; Feng

arXiv:2406.03049·cs.CL·June 6, 2024

StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning

Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min Zhang, Yang, Feng

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

StreamSpeech is a unified multi-task learning model that performs simultaneous speech translation, recognition, and synthesis, achieving state-of-the-art results and providing high-quality intermediate outputs for real-time communication.

Contribution

It introduces a novel unified model that jointly learns translation and policy for simultaneous speech translation within a multi-task framework.

Findings

01

Achieves state-of-the-art performance on CVSS benchmark

02

Provides high-quality intermediate recognition and translation results

03

Supports offline and real-time speech translation, recognition, and synthesis

Abstract

Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opportune moment within speech inputs, thereby posing a double challenge of translation and policy. In this paper, we propose StreamSpeech, a direct Simul-S2ST model that jointly learns translation and simultaneous policy in a unified framework of multi-task learning. Adhering to a multi-task learning approach, StreamSpeech can perform offline and simultaneous speech recognition, speech translation and speech synthesis via an "All-in-One" seamless model. Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ictnlp/streamspeech
pytorchOfficial

Datasets

echodict/StreamSpeech
dataset· 1.1k dl
1.1k dl

Videos

StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Speech Recognition and Synthesis

MethodsLinear Layer · Convolution · HiFi-GAN · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam