Unveiling the Role of Pretraining in Direct Speech Translation
Belen Alastruey, Gerard I. G\'allego, Marta R. Costa-juss\`a

TL;DR
This paper investigates the impact of pretraining on direct speech translation, revealing training challenges and proposing a decoder modification to enable training from scratch with comparable performance and reduced training time.
Contribution
The study identifies training difficulties in direct speech translation and introduces a decoder change that allows training from scratch effectively, reducing reliance on pretraining.
Findings
Pretrained encoders facilitate learning in speech translation.
A decoder modification enables training from scratch effectively.
Training from scratch can match pretrained performance with less time.
Abstract
Direct speech-to-text translation systems encounter an important drawback in data scarcity. A common solution consists on pretraining the encoder on automatic speech recognition, hence losing efficiency in the training process. In this study, we compare the training dynamics of a system using a pretrained encoder, the conventional approach, and one trained from scratch. We observe that, throughout the training, the randomly initialized model struggles to incorporate information from the speech inputs for its predictions. Hence, we hypothesize that this issue stems from the difficulty of effectively training an encoder for direct speech translation. While a model trained from scratch needs to learn acoustic and semantic modeling simultaneously, a pretrained one can just focus on the latter. Based on these findings, we propose a subtle change in the decoder cross-attention to integrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSubtitles and Audiovisual Media
MethodsFocus
