Efficient Speech Translation with Pre-trained Models
Zhaolin Li, Jan Niehues

TL;DR
This paper explores efficient speech translation methods using pre-trained models, demonstrating improved performance and data efficiency, especially with limited training data, through innovative training strategies and similarity loss techniques.
Contribution
It introduces strategies for building speech translation systems with pre-trained models on a single GPU and proposes a similarity loss to enhance data efficiency and translation quality.
Findings
End-to-end models outperform cascaded models in translation quality.
The similarity loss increases BLEU scores by 6 points with limited data.
Single GPU training is feasible for high-performance speech translation models.
Abstract
When building state-of-the-art speech translation models, the need for large computational resources is a significant obstacle due to the large training data size and complex models. The availability of pre-trained models is a promising opportunity to build strong speech translation systems efficiently. In a first step, we investigate efficient strategies to build cascaded and end-to-end speech translation systems based on pre-trained models. Using this strategy, we can train and apply the models on a single GPU. While the end-to-end models show superior translation performance to cascaded ones, the application of this technology has a limitation on the need for additional end-to-end training data. In a second step, we proposed an additional similarity loss to encourage the model to generate similar hidden representations for speech and transcript. Using this technique, we can increase…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
