Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning
Jannik Kossen, Neil Band, Clare Lyle, Aidan N. Gomez, Tom Rainforth,, Yarin Gal

TL;DR
This paper introduces a deep learning architecture that uses self-attention over entire datasets to explicitly model relationships between data points, enabling complex reasoning and improved performance on various tasks.
Contribution
It presents a novel dataset-wide self-attention model that learns to utilize relationships between data points, extending deep learning beyond individual input-output pairs.
Findings
Successfully solves cross-datapoint lookup tasks
Achieves competitive results on tabular data
Provides insights into data point interactions
Abstract
We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms. However, unlike conventional non-parametric models, we let the model learn end-to-end from the data how to make use of other datapoints for prediction. Empirically, our models solve cross-datapoint lookup and complex reasoning tasks unsolvable by traditional deep learning models. We show highly competitive results on tabular data, early results on CIFAR-10, and give insight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
MethodsLinear Layer · L1 Regularization · Embedding Dropout · Attention Dropout · Residual Connection · Attention Is All You Need · Dropout · Softmax · Layer Normalization · Dense Connections
