Speechformer: Reducing Information Loss in Direct Speech Translation

Sara Papi; Marco Gaido; Matteo Negri; Marco Turchi

arXiv:2109.04574·cs.CL·October 19, 2023·1 cites

Speechformer: Reducing Information Loss in Direct Speech Translation

Sara Papi, Marco Gaido, Matteo Negri, Marco Turchi

PDF

Open Access 1 Repo

TL;DR

Speechformer is a novel speech translation architecture that preserves more linguistic information by avoiding early lossy compression, leading to improved translation quality especially in low-resource scenarios.

Contribution

It introduces Speechformer, a model that reduces memory usage in attention layers, enabling higher-level aggregation of linguistic information without initial compression.

Findings

01

Up to 0.8 BLEU improvement on MuST-C corpus

02

Up to 4.0 BLEU gain in low-resource scenarios

03

Effective preservation of linguistic information in speech translation

Abstract

Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation. However, Transformer's quadratic complexity with respect to the input sequence length prevents its adoption as is with audio signals, which are typically represented by long sequences. Current solutions resort to an initial sub-optimal compression based on a fixed sampling of raw audio features. Therefore, potentially useful linguistic information is not accessible to higher-level layers in the architecture. To solve this issue, we propose Speechformer, an architecture that, thanks to reduced memory usage in the attention layers, avoids the initial lossy compression and aggregates information only at a higher level according to more informed linguistic criteria. Experiments on three language pairs (en->de/es/nl) show the efficacy of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sarapapi/fbk-fairseq
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing