TCFormer: Visual Recognition via Token Clustering Transformer

Wang Zeng; Sheng Jin; Lumin Xu; Wentao Liu; Chen Qian; Wanli Ouyang,; Ping Luo; Xiaogang Wang

arXiv:2407.11321·cs.CV·July 17, 2024·1 cites

TCFormer: Visual Recognition via Token Clustering Transformer

Wang Zeng, Sheng Jin, Lumin Xu, Wentao Liu, Chen Qian, Wanli Ouyang,, Ping Luo, Xiaogang Wang

PDF

Open Access 1 Repo

TL;DR

TCFormer introduces a dynamic token generation method for vision transformers that groups semantically similar regions and emphasizes detailed areas, improving performance across multiple vision tasks.

Contribution

It proposes a novel token clustering transformer that generates semantic-aware dynamic tokens, enhancing the representation of image regions in vision tasks.

Findings

01

Improved accuracy in image classification

02

Enhanced performance in semantic segmentation

03

Effective across diverse vision applications

Abstract

Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zengwang430521/tcformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Advanced Image and Video Retrieval Techniques · Image and Video Stabilization

MethodsAttention Is All You Need · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout · Multi-Head Attention · Dense Connections