Vision Transformers Need Registers
Timoth\'ee Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski

TL;DR
This paper identifies artifacts in Vision Transformer feature maps caused by high-norm tokens in background areas and proposes adding tokens to the input sequence to mitigate this issue, improving model performance and interpretability.
Contribution
It introduces a simple token addition method that eliminates artifacts in ViT feature maps, enhancing dense visual prediction and object discovery capabilities.
Findings
Fixes artifacts in ViT feature maps
Sets new state-of-the-art in dense visual prediction
Enables smoother attention and feature maps
Abstract
Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.
Peer Reviews
Decision·ICLR 2024 oral
- The paper identifies an interesting phenomenon observed in the popular transformer models (DINO). By removing this artifact, the authors demonstrates the improved models have clear attention maps that could be used for downstream analysis such as object localization. - The step-by-step investigation is solid and compelling. - The method of providing a junkyard to remove the artifact is novel and effective. - The experiments are convincing and comprehensive.
- The norm shows significant reduction for OpenCLIP in Figure 7, yet in Table 3, it doesn’t show significant improvement for object localization which is the main benefit of using register. Further explaination / exploration the reason behind it should be helpful for wide adoptation. - Minor: it should be OpenCLIP instead of CLIP in Figure7.
1. The paper identifies an important problem of heatmaps lacking spatial resolution and accuracy in DINOv2 and other ViTs which leads to suboptimal downstream performance on object discovery and localization tasks. The fact that this is not a problem for DINOv1 is pretty surprising and the experiments done on the changes in token norm across model size will be a useful start to understand this better. The discovery of these high-norm tokens, experiments using the linear models and classifiers to
1. The removal of these artifacts does come at the cost of new tokens, hence additional compute. The paper reports a 2-6% increase when adding 4-16 new register tokens. 2. One very interesting observation was how the different register tokens end up focussing on the different areas of interest on the object. If there are spatially discrete areas of focus for the registers, does this undermine the argument that we need them for storing global information which was earlier being done using redunda
1. The investigation is quite original; the use of memory/registers in transformers is not necessarily a new idea, but motivating them through removing redundancy and reducing attention artifacts is both novel and interesting. 2. Experiments and analysis are mostly convincing (see questions below). 3. I enjoyed the narrative exposition: the problem setting is clear, the motivation for registers is clear, and their utility is well-demonstrated via experiments.
1. While adding additional token (registers) seems like a simple and efficacious approach, I'm wondering if it's the only possible solution for reducing patch level redundancy. Did the authors observe similar effects across other self-supervised models, like MAE, where nominally the patch-level reconstruction should also alleviate representational redundancy? 2. In demonstrating that the artifacts hold global information, the authors "choose a single token at random, either high-norm or normal,
Code & Models
- 🤗facebook/dinov2-with-registers-largemodel· 113k dl· ♡ 12113k dl♡ 12
- 🤗timm/vit_base_patch14_reg4_dinov2.lvd142mmodel· 162k dl· ♡ 14162k dl♡ 14
- 🤗timm/vit_giant_patch14_reg4_dinov2.lvd142mmodel· 376 dl· ♡ 1376 dl♡ 1
- 🤗timm/vit_large_patch14_reg4_dinov2.lvd142mmodel· 265k dl· ♡ 7265k dl♡ 7
- 🤗timm/vit_small_patch14_reg4_dinov2.lvd142mmodel· 410k dl· ♡ 7410k dl♡ 7
- 🤗timm/vit_base_patch16_rope_reg1_gap_256.sbb_in1kmodel· 596 dl· ♡ 4596 dl♡ 4
- 🤗timm/vit_betwixt_patch16_reg1_gap_256.sbb_in1kmodel· 131 dl131 dl
- 🤗timm/vit_betwixt_patch16_reg4_gap_256.sbb_in1kmodel· 55 dl55 dl
- 🤗timm/vit_betwixt_patch16_reg4_gap_256.sbb_in12kmodel· 31 dl· ♡ 131 dl♡ 1
- 🤗timm/vit_betwixt_patch16_reg4_gap_256.sbb_in12k_ft_in1kmodel· 41 dl· ♡ 141 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Absolute Position Encodings · Dense Connections · Layer Normalization · Vision Transformer · Multi-Head Attention · Byte Pair Encoding
