Leveraging Registers in Vision Transformers for Robust Adaptation
Srikar Yellapragada, Kowshik Thopalli, Vivek Narayanaswamy, Wesam, Sakla, Yang Liu, Yamen Mubarka, Dimitris Samaras, Jayaraman J. Thiagarajan

TL;DR
This paper introduces a method using register tokens in Vision Transformers to improve out-of-distribution generalization and anomaly detection without extra computational cost.
Contribution
It proposes a simple technique combining CLS and register embeddings to enhance ViT robustness in OOD scenarios, a relatively unexplored area.
Findings
2-4% improvement in OOD accuracy
2-3% reduction in false positive rates
Maintains in-distribution performance
Abstract
Vision Transformers (ViTs) have shown success across a variety of tasks due to their ability to capture global image representations. Recent studies have identified the existence of high-norm tokens in ViTs, which can interfere with unsupervised object discovery. To address this, the use of "registers" which are additional tokens that isolate high norm patch tokens while capturing global image-level information has been proposed. While registers have been studied extensively for object discovery, their generalization properties particularly in out-of-distribution (OOD) scenarios, remains underexplored. In this paper, we examine the utility of register token embeddings in providing additional features for improving generalization and anomaly rejection. To that end, we propose a simple method that combines the special CLS token embedding commonly employed in ViTs with the average-pooled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · 3D Surveying and Cultural Heritage · Industrial Vision Systems and Defect Detection
