Vision Transformers Need Registers

Timoth\'ee Darcet; Maxime Oquab; Julien Mairal; Piotr Bojanowski

arXiv:2309.16588·cs.CV·April 15, 2024·51 cites

Vision Transformers Need Registers

Timoth\'ee Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski

PDF

Open Access 5 Repos 10 Models 3 Reviews

TL;DR

This paper identifies artifacts in Vision Transformer feature maps caused by high-norm tokens in background areas and proposes adding tokens to the input sequence to mitigate this issue, improving model performance and interpretability.

Contribution

It introduces a simple token addition method that eliminates artifacts in ViT feature maps, enhancing dense visual prediction and object discovery capabilities.

Findings

01

Fixes artifacts in ViT feature maps

02

Sets new state-of-the-art in dense visual prediction

03

Enables smoother attention and feature maps

Abstract

Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

Peer Reviews

Decision·ICLR 2024 oral

Reviewer 01Rating 8· accept, good paperConfidence 3

Strengths

- The paper identifies an interesting phenomenon observed in the popular transformer models (DINO). By removing this artifact, the authors demonstrates the improved models have clear attention maps that could be used for downstream analysis such as object localization. - The step-by-step investigation is solid and compelling. - The method of providing a junkyard to remove the artifact is novel and effective. - The experiments are convincing and comprehensive.

Weaknesses

- The norm shows significant reduction for OpenCLIP in Figure 7, yet in Table 3, it doesn’t show significant improvement for object localization which is the main benefit of using register. Further explaination / exploration the reason behind it should be helpful for wide adoptation. - Minor: it should be OpenCLIP instead of CLIP in Figure7.

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

1. The paper identifies an important problem of heatmaps lacking spatial resolution and accuracy in DINOv2 and other ViTs which leads to suboptimal downstream performance on object discovery and localization tasks. The fact that this is not a problem for DINOv1 is pretty surprising and the experiments done on the changes in token norm across model size will be a useful start to understand this better. The discovery of these high-norm tokens, experiments using the linear models and classifiers to

Weaknesses

1. The removal of these artifacts does come at the cost of new tokens, hence additional compute. The paper reports a 2-6% increase when adding 4-16 new register tokens. 2. One very interesting observation was how the different register tokens end up focussing on the different areas of interest on the object. If there are spatially discrete areas of focus for the registers, does this undermine the argument that we need them for storing global information which was earlier being done using redunda

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

1. The investigation is quite original; the use of memory/registers in transformers is not necessarily a new idea, but motivating them through removing redundancy and reducing attention artifacts is both novel and interesting. 2. Experiments and analysis are mostly convincing (see questions below). 3. I enjoyed the narrative exposition: the problem setting is clear, the motivation for registers is clear, and their utility is well-demonstrated via experiments.

Weaknesses

1. While adding additional token (registers) seems like a simple and efficacious approach, I'm wondering if it's the only possible solution for reducing patch level redundancy. Did the authors observe similar effects across other self-supervised models, like MAE, where nominally the patch-level reconstruction should also alleviate representational redundancy? 2. In demonstrating that the artifacts hold global information, the authors "choose a single token at random, either high-norm or normal,

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Absolute Position Encodings · Dense Connections · Layer Normalization · Vision Transformer · Multi-Head Attention · Byte Pair Encoding