Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors
Ido Amos, Jonathan Berant, Ankit Gupta

TL;DR
This paper demonstrates that pretraining with data-driven priors significantly narrows performance gaps between long-sequence models, challenging prior conclusions based on random initialization and highlighting the importance of pretraining for fair comparisons.
Contribution
It shows that pretraining with data-driven priors using only downstream data improves long-sequence models and equalizes performance between Transformers and state space models.
Findings
Pretraining reduces performance gaps between architectures.
Vanilla Transformers match S4 on Long Range Arena when pretrained.
Pretraining improves SSMs' results on PathX-256 by 20 points.
Abstract
Modeling long-range dependencies across sequences is a longstanding goal in machine learning and has led to architectures, such as state space models, that dramatically outperform Transformers on long sequences. However, these impressive empirical gains have been by and large demonstrated on benchmarks (e.g. Long Range Arena), where models are randomly initialized and trained to predict a target label from an input sequence. In this work, we show that random initialization leads to gross overestimation of the differences between architectures and that pretraining with standard denoising objectives, using , leads to dramatic gains across multiple architectures and to very small gaps between Transformers and state space models (SSMs). In stark contrast to prior works, we find vanilla Transformers to match the performance of S4 on Long Range Arena…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)
