Self-Attention Between Datapoints: Going Beyond Individual Input-Output   Pairs in Deep Learning

Jannik Kossen; Neil Band; Clare Lyle; Aidan N. Gomez; Tom Rainforth,; Yarin Gal

arXiv:2106.02584·cs.LG·February 2, 2022·43 cites

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

Jannik Kossen, Neil Band, Clare Lyle, Aidan N. Gomez, Tom Rainforth,, Yarin Gal

PDF

Open Access 3 Repos 3 Videos

TL;DR

This paper introduces a deep learning architecture that uses self-attention over entire datasets to explicitly model relationships between data points, enabling complex reasoning and improved performance on various tasks.

Contribution

It presents a novel dataset-wide self-attention model that learns to utilize relationships between data points, extending deep learning beyond individual input-output pairs.

Findings

01

Successfully solves cross-datapoint lookup tasks

02

Achieves competitive results on tabular data

03

Provides insights into data point interactions

Abstract

We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms. However, unlike conventional non-parametric models, we let the model learn end-to-end from the data how to make use of other datapoints for prediction. Empirically, our models solve cross-datapoint lookup and complex reasoning tasks unsolvable by traditional deep learning models. We show highly competitive results on tabular data, early results on CIFAR-10, and give insight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Non-Parametric Transformers | Paper explained· youtube

Stanford CS25: V1 I Self Attention and Non-parametric transformers (NPTs)· youtube

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning· slideslive

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)

MethodsLinear Layer · L1 Regularization · Embedding Dropout · Attention Dropout · Residual Connection · Attention Is All You Need · Dropout · Softmax · Layer Normalization · Dense Connections