Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens
Jaihyun Lew, Soohyuk Jang, Jaehoon Lee, Seungryong Yoo, Eunji Kim,, Saehyung Lee, Jisoo Mok, Siwon Kim, Sungroh Yoon

TL;DR
This paper introduces superpixel tokenization for Vision Transformers, replacing grid-based patches with superpixels to better preserve semantic integrity and improve accuracy and robustness in visual tasks.
Contribution
It proposes a novel superpixel-based tokenization method that maintains semantic consistency, addressing limitations of traditional patch-based tokenization in ViT.
Findings
Enhanced accuracy on downstream tasks
Improved robustness against visual variations
Strong compatibility with existing ViT frameworks
Abstract
Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Transformer (ViT) utilizes tokens from uniformly partitioned square image patches, which may result in an arbitrary mixing of visual concepts in a token. In this work, we propose to substitute the grid-based tokenization in ViT with superpixel tokenization, which employs superpixels to generate a token that encapsulates a sole visual concept. Unfortunately, the diverse shapes, sizes, and locations of superpixels make integrating superpixels into ViT tokenization rather challenging. Our tokenization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Cell Image Analysis Techniques
MethodsAdam · Position-Wise Feed-Forward Layer · Linear Layer · Softmax · Multi-Head Attention · Byte Pair Encoding · Label Smoothing · Dropout · Dense Connections · Layer Normalization
