ViT-5: Vision Transformers for The Mid-2020s

Feng Wang; Sucheng Ren; Tiezheng Zhang; Predrag Neskovic; Anand Bhattad; Cihang Xie; Alan Yuille

arXiv:2602.08071·cs.CV·February 10, 2026

ViT-5: Vision Transformers for The Mid-2020s

Feng Wang, Sucheng Ren, Tiezheng Zhang, Predrag Neskovic, Anand Bhattad, Cihang Xie, Alan Yuille

PDF

Open Access 1 Models

TL;DR

This paper introduces ViT-5, an improved Vision Transformer architecture that incorporates recent advancements, leading to better performance on classification and generative tasks while maintaining simplicity and transferability.

Contribution

ViT-5 systematically modernizes Vision Transformers with component-wise refinements, achieving state-of-the-art results and better generalization for understanding and generation tasks.

Findings

01

ViT-5-Base achieves 84.2% top-1 accuracy on ImageNet-1k.

02

ViT-5 improves FID scores in generative modeling tasks.

03

ViT-5 demonstrates enhanced representation learning and spatial reasoning.

Abstract

This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2\% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8\%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
FengWang3211/ViT-5
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning