Vision Transformers Need More Than Registers

Cheng Shi; Yizhou Yu; Sibei Yang

arXiv:2602.22394·cs.CV·April 15, 2026

Vision Transformers Need More Than Registers

Cheng Shi, Yizhou Yu, Sibei Yang

PDF

TL;DR

This paper analyzes artifacts in Vision Transformers, attributing them to background shortcuts caused by global attention, and proposes a selective patch integration method to improve performance.

Contribution

It systematically uncovers the origin of artifacts in ViTs and introduces a targeted solution that enhances their effectiveness across multiple benchmarks.

Findings

01

Artifacts in ViTs stem from background shortcuts due to global attention.

02

Selective patch integration reduces background influence and improves performance.

03

Method improves results across 12 benchmarks under various supervision paradigms.

Abstract

Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.