Direct Speech-to-speech Translation without Textual Annotation using   Bottleneck Features

Junhui Zhang; Junjie Pan; Xiang Yin; Zejun Ma

arXiv:2212.05805·cs.CL·December 13, 2022·1 cites

Direct Speech-to-speech Translation without Textual Annotation using Bottleneck Features

Junhui Zhang, Junjie Pan, Xiang Yin, Zejun Ma

PDF

Open Access

TL;DR

This paper introduces a direct speech-to-speech translation model that eliminates the need for textual annotations by using bottleneck features, achieving comparable performance to traditional cascaded systems.

Contribution

It proposes a novel end-to-end speech translation approach using bottleneck features as intermediate objectives, removing the requirement for textual annotations during training.

Findings

01

Performance matches cascaded systems in translation quality

02

Feasibility demonstrated on Mandarin-Cantonese translation

03

No need for textual annotation or phoneme prediction modules

Abstract

Speech-to-speech translation directly translates a speech utterance to another between different languages, and has great potential in tasks such as simultaneous interpretation. State-of-art models usually contains an auxiliary module for phoneme sequences prediction, and this requires textual annotation of the training dataset. We propose a direct speech-to-speech translation model which can be trained without any textual annotation or content information. Instead of introducing an auxiliary phoneme prediction task in the model, we propose to use bottleneck features as intermediate training objectives for our model to ensure the translation performance of the system. Experiments on Mandarin-Cantonese speech translation demonstrate the feasibility of the proposed approach and the performance can match a cascaded system with respect of translation and synthesis qualities.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis