Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers

Leonidas Gee; Wing Yan Li; Viktoriia Sharmanska; Novi Quadrianto

arXiv:2411.15397·cs.CV·December 1, 2025

Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers

Leonidas Gee, Wing Yan Li, Viktoriia Sharmanska, Novi Quadrianto

PDF

Open Access 1 Repo 3 Models

TL;DR

The paper introduces the Visual-Word Tokenizer, a training-free method that compresses vision transformer tokens by grouping similar visual subwords, significantly reducing energy consumption during inference with minimal performance loss.

Contribution

It presents the VWT technique that leverages intra- and inter-image statistics for sequence compression without additional training, improving energy efficiency in vision transformers.

Findings

01

Energy consumption reduced by up to 47%

02

Outperforms quantization and token merging in energy efficiency

03

Suitable for real-time online inference with minimal performance impact

Abstract

The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression techniques require additional end-to-end fine-tuning or incur a significant drawback to energy efficiency, making them ill-suited for online (real-time) inference, where a prediction is made on any new input as it comes in. We introduce the $Visual-Word Tokenizer$ (VWT), a training-free method for reducing energy costs while retaining performance. The VWT groups visual subwords (image patches) that are frequently used into visual words, while infrequent ones remain intact. To do so, $intra$ -image or $inter$ -image statistics are leveraged to identify similar visual concepts for sequence compression. Experimentally, we demonstrate a reduction in energy consumed of up to 47%. Comparative approaches of 8-bit quantization and token merging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wearepal/visual-word-tokenizer
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Advanced Neural Network Applications · CCD and CMOS Imaging Sensors