General-purpose, long-context autoregressive modeling with Perceiver AR
Curtis Hawthorne, Andrew Jaegle, C\u{a}t\u{a}lina Cangea, Sebastian, Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals,, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste, Alayrac, Jo\~ao Carreira, Jesse Engel

TL;DR
Perceiver AR is a scalable, modality-agnostic autoregressive model capable of handling over a hundred thousand tokens, enabling effective long-context density estimation for images and music with state-of-the-art results.
Contribution
It introduces Perceiver AR, a novel architecture that efficiently models long-range dependencies using cross-attention, overcoming the scalability limitations of traditional Transformers.
Findings
Achieves state-of-the-art likelihood on long-sequence benchmarks.
Generates coherent long-term structure in images and music.
Handles over a hundred thousand tokens efficiently.
Abstract
Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking. Perceiver AR can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms. When trained on images or music, Perceiver AR generates outputs with clear long-term coherence and structure. Our architecture also obtains state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image and Signal Denoising Methods · Advanced Neural Network Applications
