State Spaces Aren't Enough: Machine Translation Needs Attention
Ali Vardasbi, Telmo Pessoa Pires, Robin M. Schmidt, Stephan Peitz

TL;DR
This paper evaluates the application of Structured State Spaces (S4) to machine translation, revealing its limitations compared to Transformers and demonstrating that adding attention improves performance.
Contribution
It shows that S4, despite success in other domains, underperforms in machine translation due to its inability to summarize long sentences, and proposes adding attention to address this.
Findings
S4 lags behind Transformer by ~4 BLEU points in MT.
S4 struggles with long sentences in translation.
Adding attention to S4 closes the performance gap.
Abstract
Structured State Spaces for Sequences (S4) is a recently proposed sequence model with successful applications in various tasks, e.g. vision, language modeling, and audio. Thanks to its mathematical formulation, it compresses its input to a single hidden state, and is able to capture long range dependencies while avoiding the need for an attention mechanism. In this work, we apply S4 to Machine Translation (MT), and evaluate several encoder-decoder variants on WMT'14 and WMT'16. In contrast with the success in language modeling, we find that S4 lags behind the Transformer by approximately 4 BLEU points, and that it counter-intuitively struggles with long sentences. Finally, we show that this gap is caused by S4's inability to summarize the full source sentence in a single hidden state, and show that we can close the gap by introducing an attention mechanism.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Label Smoothing · Dropout · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection
