Deeper Inside Deep ViT

Sungrae Hong

arXiv:2508.04181·cs.CV·August 7, 2025

Deeper Inside Deep ViT

Sungrae Hong

PDF

TL;DR

This paper investigates the training stability, practical utility, and image generation capabilities of large-scale Vision Transformer models like ViT-22B, providing insights into their performance and modifications for improved stability.

Contribution

It introduces model modifications to stabilize training of ViT-22B, compares ViT and ViT-22B for image generation, and evaluates their performance and utility.

Findings

01

ViT-22B outperforms ViT at the same parameter size.

02

Training instability issues are identified and addressed.

03

ViT-22B shows potential for image generation tasks.

Abstract

There have been attempts to create large-scale structures in vision models similar to LLM, such as ViT-22B. While this research has provided numerous analyses and insights, our understanding of its practical utility remains incomplete. Therefore, we examine how this model structure reacts and train in a local environment. We also highlight the instability in training and make some model modifications to stabilize it. The ViT-22B model, trained from scratch, overall outperformed ViT in terms of performance under the same parameter size. Additionally, we venture into the task of image generation, which has not been attempted in ViT-22B. We propose an image generation architecture using ViT and investigate which between ViT and ViT-22B is a more suitable structure for image generation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.