Textless Streaming Speech-to-Speech Translation using Semantic Speech   Tokens

Jinzheng Zhao; Niko Moritz; Egor Lakomkin; Ruiming Xie; Zhiping Xiu,; Katerina Zmolikova; Zeeshan Ahmed; Yashesh Gaur; Duc Le; Christian Fuegen

arXiv:2410.03298·eess.AS·October 7, 2024·ICASSP

Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens

Jinzheng Zhao, Niko Moritz, Egor Lakomkin, Ruiming Xie, Zhiping Xiu,, Katerina Zmolikova, Zeeshan Ahmed, Yashesh Gaur, Duc Le, Christian Fuegen

PDF

Open Access

TL;DR

This paper introduces a low-latency, streaming speech translation model that directly produces speech tokens, avoiding text generation and reducing error accumulation and latency in speech-to-speech translation.

Contribution

A novel transducer-based model that outputs discrete speech tokens directly for streaming translation, outperforming existing methods in latency and translation quality.

Findings

01

Outperforms existing approaches in BLEU scores

02

Achieves state-of-the-art latency and quality metrics

03

Effective across multiple language pairs

Abstract

Cascaded speech-to-speech translation systems often suffer from the error accumulation problem and high latency, which is a result of cascaded modules whose inference delays accumulate. In this paper, we propose a transducer-based speech translation model that outputs discrete speech tokens in a low-latency streaming fashion. This approach eliminates the need for generating text output first, followed by machine translation (MT) and text-to-speech (TTS) systems. The produced speech tokens can be directly used to generate a speech signal with low latency by utilizing an acoustic language model (LM) to obtain acoustic tokens and an audio codec model to retrieve the waveform. Experimental results show that the proposed method outperforms other existing approaches and achieves state-of-the-art results for streaming translation in terms of BLEU, average latency, and BLASER 2.0 scores for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Speech Recognition and Synthesis