Token Transformer: Can class token help window-based transformer build better long-range interactions?
Jiawei Mao, Yuanqi Chang, Xuesong Yin

TL;DR
The paper introduces Token Transformer (TT), which enhances window-based transformers with class tokens for improved long-range interactions, achieving competitive results efficiently.
Contribution
The novel CLS Attention mechanism and Feature Inheritance Module enable better long-range modeling while maintaining hierarchical structure.
Findings
TT achieves state-of-the-art accuracy with fewer parameters.
CLS tokens improve long-range interaction in window-based transformers.
TT performs well on image classification and downstream tasks.
Abstract
Compared with the vanilla transformer, the window-based transformer offers a better trade-off between accuracy and efficiency. Although the window-based transformer has made great progress, its long-range modeling capabilities are limited due to the size of the local window and the window connection scheme. To address this problem, we propose a novel Token Transformer (TT). The core mechanism of TT is the addition of a Class (CLS) token for summarizing window information in each local window. We refer to this type of token interaction as CLS Attention. These CLS tokens will interact spatially with the tokens in each window to enable long-range modeling. In order to preserve the hierarchical design of the window-based transformer, we designed Feature Inheritance Module (FIM) in each phase of TT to deliver the local window information from the previous phase to the CLS token in the next…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification
MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Linear Layer · Softmax · Adam · Absolute Position Encodings · Byte Pair Encoding
