FTCFormer: Fuzzy Token Clustering Transformer for Image Classification

Muyi Bao; Changyu Zeng; Yifan Wang; Zhengni Yang; Zimu Wang; Guangliang Cheng; Jun Qi; Wei Wang

arXiv:2507.10283·cs.CV·July 15, 2025

FTCFormer: Fuzzy Token Clustering Transformer for Image Classification

Muyi Bao, Changyu Zeng, Yifan Wang, Zhengni Yang, Zimu Wang, Guangliang Cheng, Jun Qi, Wei Wang

PDF

TL;DR

FTCFormer introduces a clustering-based token generation method for transformers that emphasizes semantic regions over spatial positions, improving image classification across diverse datasets.

Contribution

It proposes a novel clustering-based downsampling module and associated mechanisms to generate semantically meaningful tokens in transformer architectures.

Findings

01

Achieves consistent accuracy improvements across 32 datasets.

02

Improves 1.43% on fine-grained datasets.

03

Enhances feature representation by focusing on semantic regions.

Abstract

Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks, largely attributed to their long-range self-attention mechanism and scalability. However, most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions, resulting in suboptimal feature representations. To address this issue, we propose Fuzzy Token Clustering Transformer (FTCFormer), which incorporates a novel clustering-based downsampling module to dynamically generate vision tokens based on the semantic meanings instead of spatial positions. It allocates fewer tokens to less informative regions and more to represent semantically important regions, regardless of their spatial adjacency or shape irregularity. To further enhance feature extraction and representation, we propose a Density Peak…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.