On the locality bias and results in the Long Range Arena

Pablo Miralles-Gonz\'alez; Javier Huertas-Tato; Alejandro Mart\'in,; David Camacho

arXiv:2501.14850·cs.CL·January 28, 2025

On the locality bias and results in the Long Range Arena

Pablo Miralles-Gonz\'alez, Javier Huertas-Tato, Alejandro Mart\'in,, David Camacho

PDF

Open Access

TL;DR

This paper analyzes the Long Range Arena benchmark, revealing that short-range dependencies largely drive performance and that architectural biases influence results, suggesting the need for a redesigned benchmark.

Contribution

It explains why architectures like SSMs outperform Transformers in LRA and demonstrates how training techniques can improve Transformer performance, challenging the benchmark's validity.

Findings

01

Most LRA performance stems from short-range dependencies.

02

Proper positional encoding enables Transformers to achieve state-of-the-art results.

03

Removing restrictions on SSM kernels does not decrease performance.

Abstract

The Long Range Arena (LRA) benchmark was designed to evaluate the performance of Transformer improvements and alternatives in long-range dependency modeling tasks. The Transformer and its main variants performed poorly on this benchmark, and a new series of architectures such as State Space Models (SSMs) gained some traction, greatly outperforming Transformers in the LRA. Recent work has shown that with a denoising pre-training phase, Transformers can achieve competitive results in the LRA with these new architectures. In this work, we discuss and explain the superiority of architectures such as MEGA and SSMs in the Long Range Arena, as well as the recent improvement in the results of Transformers, pointing to the positional and local nature of the tasks. We show that while the LRA is a benchmark for long-range dependency modeling, in reality most of the performance comes from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSeismic Imaging and Inversion Techniques

MethodsSoftmax · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing