StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning
Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min Zhang, Yang, Feng

TL;DR
StreamSpeech is a unified multi-task learning model that performs simultaneous speech translation, recognition, and synthesis, achieving state-of-the-art results and providing high-quality intermediate outputs for real-time communication.
Contribution
It introduces a novel unified model that jointly learns translation and policy for simultaneous speech translation within a multi-task framework.
Findings
Achieves state-of-the-art performance on CVSS benchmark
Provides high-quality intermediate recognition and translation results
Supports offline and real-time speech translation, recognition, and synthesis
Abstract
Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opportune moment within speech inputs, thereby posing a double challenge of translation and policy. In this paper, we propose StreamSpeech, a direct Simul-S2ST model that jointly learns translation and simultaneous policy in a unified framework of multi-task learning. Adhering to a multi-task learning approach, StreamSpeech can perform offline and simultaneous speech recognition, speech translation and speech synthesis via an "All-in-One" seamless model. Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Speech Recognition and Synthesis
MethodsLinear Layer · Convolution · HiFi-GAN · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam
