Scaling White-Box Transformers for Vision

Jinrui Yang; Xianhang Li; Druv Pai; Yuyin Zhou; Yi Ma; Yaodong Yu,; Cihang Xie

arXiv:2405.20299·cs.CV·January 15, 2025·2 cites

Scaling White-Box Transformers for Vision

Jinrui Yang, Xianhang Li, Druv Pai, Yuyin Zhou, Yi Ma, Yaodong Yu,, Cihang Xie

PDF

Open Access

TL;DR

This paper introduces CRATE-$\alpha$, an improved scalable white-box transformer architecture for vision tasks, achieving higher accuracy on ImageNet while maintaining interpretability, through minimal modifications and a light training recipe.

Contribution

The paper presents CRATE-$\alpha$, a scalable version of the CRATE architecture with enhanced performance and interpretability for vision transformers, addressing previous scalability limitations.

Findings

01

CRATE-$\alpha$-B achieves 83.2% ImageNet accuracy, surpassing prior models.

02

CRATE-$\alpha$-L reaches 85.1% accuracy on ImageNet.

03

Model interpretability is preserved and improved with larger CRATE-$\alpha$ models.

Abstract

CRATE, a white-box transformer architecture designed to learn compressed and sparse representations, offers an intriguing alternative to standard vision transformers (ViTs) due to its inherent mathematical interpretability. Despite extensive investigations into the scaling behaviors of language and vision transformers, the scalability of CRATE remains an open question which this paper aims to address. Specifically, we propose CRATE- $α$ , featuring strategic yet minimal modifications to the sparse coding block in the CRATE architecture design, and a light training recipe designed to improve the scalability of CRATE. Through extensive experiments, we demonstrate that CRATE- $α$ can effectively scale with larger model sizes and datasets. For example, our CRATE- $α$ -B substantially outperforms the prior best CRATE-B model accuracy on ImageNet classification by 3.7%, achieving an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptical measurement and interference techniques · Industrial Vision Systems and Defect Detection · Infrared Target Detection Methodologies