The IWSLT 2021 BUT Speech Translation Systems
Hari Krishna Vydana, Martin Karafi'at, Luk'as Burget, "Honza", Cernock'y

TL;DR
This paper presents BUT's joint speech recognition and translation systems for English-German translation, emphasizing the benefits of large-scale pre-training and integrated models for improved translation quality.
Contribution
It introduces a joint ASR-MT training approach utilizing internal representations and large text-only data, enhancing speech translation performance.
Findings
Joint training improves translation accuracy.
Using punctuated ASR outputs enhances translation quality.
Pre-training on large datasets benefits end-to-end speech translation.
Abstract
The paper describes BUT's English to German offline speech translation(ST) systems developed for IWSLT2021. They are based on jointly trained Automatic Speech Recognition-Machine Translation models. Their performances is evaluated on MustC-Common test set. In this work, we study their efficiency from the perspective of having a large amount of separate ASR training data and MT training data, and a smaller amount of speech-translation training data. Large amounts of ASR and MT training data are utilized for pre-training the ASR and MT models. Speech-translation data is used to jointly optimize ASR-MT models by defining an end-to-end differentiable path from speech to translations. For this purpose, we use the internal continuous representations from the ASR-decoder as the input to MT module. We show that speech translation can be further improved by training the ASR-decoder jointly with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
