Improved Image Classification with Token Fusion
Keong Hun Choi, Jin Woo Kim, Yao Wang, Jong Eun Ha

TL;DR
This paper introduces a novel image classification approach that fuses CNN and transformer features through three different token fusion methods, achieving superior performance on ImageNet 1k.
Contribution
It presents three new token fusion techniques combining CNN and transformer features for improved image classification.
Findings
Achieved state-of-the-art accuracy on ImageNet 1k
Demonstrated effectiveness of multi-level token fusion methods
Compared fusion strategies and identified the most effective approach
Abstract
In this paper, we propose a method using the fusion of CNN and transformer structure to improve image classification performance. In the case of CNN, information about a local area on an image can be extracted well, but there is a limit to the extraction of global information. On the other hand, the transformer has an advantage in relatively global extraction, but has a disadvantage in that it requires a lot of memory for local feature value extraction. In the case of an image, it is converted into a feature map through CNN, and each feature map's pixel is considered a token. At the same time, the image is divided into patch areas and then fused with the transformer method that views them as tokens. For the fusion of tokens with two different characteristics, we propose three methods: (1) late token fusion with parallel structure, (2) early token fusion, (3) token fusion in a layer by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction
