UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

Congpei Qiu; Zhaoyu Hu; Wei Ke; Zhuotao Tian; Yanhao Wu; Tong Zhang

arXiv:2605.19622·cs.CV·May 20, 2026

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

Congpei Qiu, Zhaoyu Hu, Wei Ke, Zhuotao Tian, Yanhao Wu, Tong Zhang

PDF

TL;DR

UniRefiner is a universal framework that refines pre-trained Vision Transformers by teaching them to self-dispose of spurious tokens, significantly improving their performance on spatially sensitive tasks.

Contribution

It introduces a comprehensive diagnosis of spurious tokens and proposes a contrastive register-based method to refine ViTs without extensive retraining.

Findings

01

Refined EVA-CLIP-8B achieves 51.9% mIoU on ADE20K (+9.4%)

02

Zero-shot segmentation accuracy improves by up to 22%

03

Method requires only a few epochs on ~5k images for effective refinement.

Abstract

Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.