General-purpose, long-context autoregressive modeling with Perceiver AR

Curtis Hawthorne; Andrew Jaegle; C\u{a}t\u{a}lina Cangea; Sebastian; Borgeaud; Charlie Nash; Mateusz Malinowski; Sander Dieleman; Oriol Vinyals,; Matthew Botvinick; Ian Simon; Hannah Sheahan; Neil Zeghidour; Jean-Baptiste; Alayrac; Jo\~ao Carreira; Jesse Engel

arXiv:2202.07765·cs.LG·June 15, 2022·34 cites

General-purpose, long-context autoregressive modeling with Perceiver AR

Curtis Hawthorne, Andrew Jaegle, C\u{a}t\u{a}lina Cangea, Sebastian, Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals,, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste, Alayrac, Jo\~ao Carreira, Jesse Engel

PDF

Open Access 3 Repos 2 Models

TL;DR

Perceiver AR is a scalable, modality-agnostic autoregressive model capable of handling over a hundred thousand tokens, enabling effective long-context density estimation for images and music with state-of-the-art results.

Contribution

It introduces Perceiver AR, a novel architecture that efficiently models long-range dependencies using cross-attention, overcoming the scalability limitations of traditional Transformers.

Findings

01

Achieves state-of-the-art likelihood on long-sequence benchmarks.

02

Generates coherent long-term structure in images and music.

03

Handles over a hundred thousand tokens efficiently.

Abstract

Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking. Perceiver AR can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms. When trained on images or music, Perceiver AR generates outputs with clear long-term coherence and structure. Our architecture also obtains state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image and Signal Denoising Methods · Advanced Neural Network Applications