The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023   Speech-to-Speech Translation Task

Kun Song; Yi lei; Peikun Chen; Yiqing Cao; Kun Wei; Yongmao Zhang; Lei; Xie; Ning Jiang; Guoqing Zhao

arXiv:2307.04630·cs.SD·July 11, 2023

The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

Kun Song, Yi lei, Peikun Chen, Yiqing Cao, Kun Wei, Yongmao Zhang, Lei, Xie, Ning Jiang, Guoqing Zhao

PDF

Open Access

TL;DR

This paper presents a cascaded speech-to-speech translation system for IWSLT 2023 that effectively handles multi-source input, noisy transcripts, and produces natural, speaker-consistent Chinese speech from English input.

Contribution

The system introduces robust multi-source handling, a three-stage fine-tuning strategy, and a two-stage TTS framework with speaker transfer, advancing speech translation quality and robustness.

Findings

01

High translation accuracy and speech naturalness achieved.

02

Demonstrates robustness to multi-source and noisy input.

03

Effective speaker timbre transfer in translated speech.

Abstract

This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Specifically, to improve the robustness to multi-source speech input, we adopt various data augmentation strategies and a ROVER-based score fusion on multiple ASR model outputs. To better handle the noisy ASR transcripts, we introduce a three-stage fine-tuning strategy to improve translation accuracy. Finally, we build a TTS model with high naturalness and sound quality, which leverages a two-stage framework, using network bottleneck features as a robust intermediate representation for speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing