Superpixel Tokenization for Vision Transformers: Preserving Semantic   Integrity in Visual Tokens

Jaihyun Lew; Soohyuk Jang; Jaehoon Lee; Seungryong Yoo; Eunji Kim,; Saehyung Lee; Jisoo Mok; Siwon Kim; Sungroh Yoon

arXiv:2412.04680·cs.CV·March 26, 2025

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Jaihyun Lew, Soohyuk Jang, Jaehoon Lee, Seungryong Yoo, Eunji Kim,, Saehyung Lee, Jisoo Mok, Siwon Kim, Sungroh Yoon

PDF

Open Access 1 Repo

TL;DR

This paper introduces superpixel tokenization for Vision Transformers, replacing grid-based patches with superpixels to better preserve semantic integrity and improve accuracy and robustness in visual tasks.

Contribution

It proposes a novel superpixel-based tokenization method that maintains semantic consistency, addressing limitations of traditional patch-based tokenization in ViT.

Findings

01

Enhanced accuracy on downstream tasks

02

Improved robustness against visual variations

03

Strong compatibility with existing ViT frameworks

Abstract

Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Transformer (ViT) utilizes tokens from uniformly partitioned square image patches, which may result in an arbitrary mixing of visual concepts in a token. In this work, we propose to substitute the grid-based tokenization in ViT with superpixel tokenization, which employs superpixels to generate a token that encapsulates a sole visual concept. Unfortunately, the diverse shapes, sizes, and locations of superpixels make integrating superpixels into ViT tokenization rather challenging. Our tokenization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jangsoohyuk/SuiT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Cell Image Analysis Techniques

MethodsAdam · Position-Wise Feed-Forward Layer · Linear Layer · Softmax · Multi-Head Attention · Byte Pair Encoding · Label Smoothing · Dropout · Dense Connections · Layer Normalization