UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
Congpei Qiu, Zhaoyu Hu, Wei Ke, Zhuotao Tian, Yanhao Wu, Tong Zhang

TL;DR
UniRefiner is a universal framework that refines pre-trained Vision Transformers by teaching them to self-dispose of spurious tokens, significantly improving their performance on spatially sensitive tasks.
Contribution
It introduces a comprehensive diagnosis of spurious tokens and proposes a contrastive register-based method to refine ViTs without extensive retraining.
Findings
Refined EVA-CLIP-8B achieves 51.9% mIoU on ADE20K (+9.4%)
Zero-shot segmentation accuracy improves by up to 22%
Method requires only a few epochs on ~5k images for effective refinement.
Abstract
Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
