Never Train from Scratch: Fair Comparison of Long-Sequence Models   Requires Data-Driven Priors

Ido Amos; Jonathan Berant; Ankit Gupta

arXiv:2310.02980·cs.LG·April 30, 2024·5 cites

Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors

Ido Amos, Jonathan Berant, Ankit Gupta

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper demonstrates that pretraining with data-driven priors significantly narrows performance gaps between long-sequence models, challenging prior conclusions based on random initialization and highlighting the importance of pretraining for fair comparisons.

Contribution

It shows that pretraining with data-driven priors using only downstream data improves long-sequence models and equalizes performance between Transformers and state space models.

Findings

01

Pretraining reduces performance gaps between architectures.

02

Vanilla Transformers match S4 on Long Range Arena when pretrained.

03

Pretraining improves SSMs' results on PathX-256 by 20 points.

Abstract

Modeling long-range dependencies across sequences is a longstanding goal in machine learning and has led to architectures, such as state space models, that dramatically outperform Transformers on long sequences. However, these impressive empirical gains have been by and large demonstrated on benchmarks (e.g. Long Range Arena), where models are randomly initialized and trained to predict a target label from an input sequence. In this work, we show that random initialization leads to gross overestimation of the differences between architectures and that pretraining with standard denoising objectives, using $only the downstream task data$ , leads to dramatic gains across multiple architectures and to very small gaps between Transformers and state space models (SSMs). In stark contrast to prior works, we find vanilla Transformers to match the performance of S4 on Long Range Arena…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

idoamos/not-from-scratch
pytorchOfficial

Videos

Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors· slideslive

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)