On the locality bias and results in the Long Range Arena
Pablo Miralles-Gonz\'alez, Javier Huertas-Tato, Alejandro Mart\'in,, David Camacho

TL;DR
This paper analyzes the Long Range Arena benchmark, revealing that short-range dependencies largely drive performance and that architectural biases influence results, suggesting the need for a redesigned benchmark.
Contribution
It explains why architectures like SSMs outperform Transformers in LRA and demonstrates how training techniques can improve Transformer performance, challenging the benchmark's validity.
Findings
Most LRA performance stems from short-range dependencies.
Proper positional encoding enables Transformers to achieve state-of-the-art results.
Removing restrictions on SSM kernels does not decrease performance.
Abstract
The Long Range Arena (LRA) benchmark was designed to evaluate the performance of Transformer improvements and alternatives in long-range dependency modeling tasks. The Transformer and its main variants performed poorly on this benchmark, and a new series of architectures such as State Space Models (SSMs) gained some traction, greatly outperforming Transformers in the LRA. Recent work has shown that with a denoising pre-training phase, Transformers can achieve competitive results in the LRA with these new architectures. In this work, we discuss and explain the superiority of architectures such as MEGA and SSMs in the Long Range Arena, as well as the recent improvement in the results of Transformers, pointing to the positional and local nature of the tasks. We show that while the LRA is a benchmark for long-range dependency modeling, in reality most of the performance comes from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSeismic Imaging and Inversion Techniques
MethodsSoftmax · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing
