The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task
Kun Song, Yi lei, Peikun Chen, Yiqing Cao, Kun Wei, Yongmao Zhang, Lei, Xie, Ning Jiang, Guoqing Zhao

TL;DR
This paper presents a cascaded speech-to-speech translation system for IWSLT 2023 that effectively handles multi-source input, noisy transcripts, and produces natural, speaker-consistent Chinese speech from English input.
Contribution
The system introduces robust multi-source handling, a three-stage fine-tuning strategy, and a two-stage TTS framework with speaker transfer, advancing speech translation quality and robustness.
Findings
High translation accuracy and speech naturalness achieved.
Demonstrates robustness to multi-source and noisy input.
Effective speaker timbre transfer in translated speech.
Abstract
This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Specifically, to improve the robustness to multi-source speech input, we adopt various data augmentation strategies and a ROVER-based score fusion on multiple ASR model outputs. To better handle the noisy ASR transcripts, we introduce a three-stage fine-tuning strategy to improve translation accuracy. Finally, we build a TTS model with high naturalness and sound quality, which leverages a two-stage framework, using network bottleneck features as a robust intermediate representation for speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
