Wavelet-Based Image Tokenizer for Vision Transformers

Zhenhai Zhu; Radu Soricut

arXiv:2405.18616·cs.CV·May 30, 2024·1 cites

Wavelet-Based Image Tokenizer for Vision Transformers

Zhenhai Zhu, Radu Soricut

PDF

Open Access

TL;DR

This paper introduces a wavelet-based image tokenizer for Vision Transformers, improving training efficiency, accuracy, and robustness, and opening new research directions in image tokenization methods.

Contribution

It proposes a novel wavelet transformation-based tokenizer that enhances ViT performance and throughput without changing the model architecture.

Findings

01

Higher training throughput with the new tokenizer

02

Improved top-1 accuracy on ImageNet validation set

03

Enhanced resistance to adversarial attacks

Abstract

Non-overlapping patch-wise convolution is the default image tokenizer for all state-of-the-art vision Transformer (ViT) models. Even though many ViT variants have been proposed to improve its efficiency and accuracy, little research on improving the image tokenizer itself has been reported in the literature. In this paper, we propose a new image tokenizer based on wavelet transformation. We show that ViT models with the new tokenizer achieve both higher training throughput and better top-1 precision for the ImageNet validation set. We present a theoretical analysis on why the proposed tokenizer improves the training throughput without any change to ViT model architecture. Our analysis suggests that the new tokenizer can effectively handle high-resolution images and is naturally resistant to adversarial attack. Furthermore, the proposed image tokenizer offers a fresh perspective on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Byte Pair Encoding · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Absolute Position Encodings · Softmax · Layer Normalization