Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers
Leonidas Gee, Wing Yan Li, Viktoriia Sharmanska, Novi Quadrianto

TL;DR
The paper introduces the Visual-Word Tokenizer, a training-free method that compresses vision transformer tokens by grouping similar visual subwords, significantly reducing energy consumption during inference with minimal performance loss.
Contribution
It presents the VWT technique that leverages intra- and inter-image statistics for sequence compression without additional training, improving energy efficiency in vision transformers.
Findings
Energy consumption reduced by up to 47%
Outperforms quantization and token merging in energy efficiency
Suitable for real-time online inference with minimal performance impact
Abstract
The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression techniques require additional end-to-end fine-tuning or incur a significant drawback to energy efficiency, making them ill-suited for online (real-time) inference, where a prediction is made on any new input as it comes in. We introduce the (VWT), a training-free method for reducing energy costs while retaining performance. The VWT groups visual subwords (image patches) that are frequently used into visual words, while infrequent ones remain intact. To do so, -image or -image statistics are leveraged to identify similar visual concepts for sequence compression. Experimentally, we demonstrate a reduction in energy consumed of up to 47%. Comparative approaches of 8-bit quantization and token merging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Advanced Neural Network Applications · CCD and CMOS Imaging Sensors
