Wavelet-Based Image Tokenizer for Vision Transformers
Zhenhai Zhu, Radu Soricut

TL;DR
This paper introduces a wavelet-based image tokenizer for Vision Transformers, improving training efficiency, accuracy, and robustness, and opening new research directions in image tokenization methods.
Contribution
It proposes a novel wavelet transformation-based tokenizer that enhances ViT performance and throughput without changing the model architecture.
Findings
Higher training throughput with the new tokenizer
Improved top-1 accuracy on ImageNet validation set
Enhanced resistance to adversarial attacks
Abstract
Non-overlapping patch-wise convolution is the default image tokenizer for all state-of-the-art vision Transformer (ViT) models. Even though many ViT variants have been proposed to improve its efficiency and accuracy, little research on improving the image tokenizer itself has been reported in the literature. In this paper, we propose a new image tokenizer based on wavelet transformation. We show that ViT models with the new tokenizer achieve both higher training throughput and better top-1 precision for the ImageNet validation set. We present a theoretical analysis on why the proposed tokenizer improves the training throughput without any change to ViT model architecture. Our analysis suggests that the new tokenizer can effectively handle high-resolution images and is naturally resistant to adversarial attack. Furthermore, the proposed image tokenizer offers a fresh perspective on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Byte Pair Encoding · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Absolute Position Encodings · Softmax · Layer Normalization
