Perceiver: General Perception with Iterative Attention
Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, and Oriol Vinyals, Joao Carreira

TL;DR
The paper introduces the Perceiver, a scalable Transformer-based model that processes diverse high-dimensional inputs across multiple modalities using iterative attention, achieving competitive performance without domain-specific assumptions.
Contribution
It presents a novel architecture that generalizes across modalities and scales to large inputs by using an asymmetric attention mechanism and iterative input distillation.
Findings
Perceiver outperforms or matches specialized models on various modality tasks.
Achieves ImageNet performance comparable to ResNet-50 and ViT without convolutions.
Perceiver performs well on AudioSet across different input types.
Abstract
Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗krasserm/perceiver-io-img-clf-mnistmodel· 4 dl4 dl
- 🤗HuggingFaceM4/idefics2-8b-basemodel· 1.6k dl· ♡ 281.6k dl♡ 28
- 🤗HuggingFaceM4/idefics2-8bmodel· 157k dl· ♡ 620157k dl♡ 620
- 🤗HuggingFaceM4/idefics2-8b-chattymodel· 70 dl· ♡ 9570 dl♡ 95
- 🤗Trelis/idefics2-8b-chatty-bf16model· 8 dl· ♡ 18 dl♡ 1
- 🤗huz-relay/idefics2-8b-ocrmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗peterpeter8585/ai2model· 1 dl1 dl
Videos
Taxonomy
TopicsNeural dynamics and brain function · Cell Image Analysis Techniques · Anomaly Detection Techniques and Applications
