Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders
Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, shen huang, Qi Ju, Tong, Xiao, Jingbo Zhu

TL;DR
This paper introduces a novel stacked acoustic-and-textual encoding approach for speech translation that effectively integrates pre-trained models, leading to state-of-the-art results and surpassing traditional cascaded systems.
Contribution
It proposes a new SATE method combining acoustic and textual encoding, along with an adaptor and knowledge distillation, to improve end-to-end speech translation performance.
Findings
Achieves state-of-the-art BLEU scores of 18.3 and 25.2 on LibriSpeech En-Fr and MuST-C En-De.
First end-to-end ST system with comparable or better performance than cascaded systems.
Effectively incorporates pre-trained models into speech translation encoders.
Abstract
Encoder pre-training is promising in end-to-end Speech Translation (ST), given the fact that speech-to-translation data is scarce. But ST encoders are not simple instances of Automatic Speech Recognition (ASR) or Machine Translation (MT) encoders. For example, we find that ASR encoders lack the global context representation, which is necessary for translation, whereas MT encoders are not designed to deal with long but locally attentive acoustic sequences. In this work, we propose a Stacked Acoustic-and-Textual Encoding (SATE) method for speech translation. Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an MT encoder for a global representation of the input sequence. In this way, it is straightforward to incorporate the pre-trained models into the system. Also, we develop an adaptor module to alleviate the representation inconsistency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
MethodsKnowledge Distillation
