Speechformer: Reducing Information Loss in Direct Speech Translation
Sara Papi, Marco Gaido, Matteo Negri, Marco Turchi

TL;DR
Speechformer is a novel speech translation architecture that preserves more linguistic information by avoiding early lossy compression, leading to improved translation quality especially in low-resource scenarios.
Contribution
It introduces Speechformer, a model that reduces memory usage in attention layers, enabling higher-level aggregation of linguistic information without initial compression.
Findings
Up to 0.8 BLEU improvement on MuST-C corpus
Up to 4.0 BLEU gain in low-resource scenarios
Effective preservation of linguistic information in speech translation
Abstract
Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation. However, Transformer's quadratic complexity with respect to the input sequence length prevents its adoption as is with audio signals, which are typically represented by long sequences. Current solutions resort to an initial sub-optimal compression based on a fixed sampling of raw audio features. Therefore, potentially useful linguistic information is not accessible to higher-level layers in the architecture. To solve this issue, we propose Speechformer, an architecture that, thanks to reduced memory usage in the attention layers, avoids the initial lossy compression and aggregates information only at a higher level according to more informed linguistic criteria. Experiments on three language pairs (en->de/es/nl) show the efficacy of our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
