Feature Fusion Vision Transformer for Fine-Grained Visual Categorization
Jun Wang, Xiaohan Yu, Yongsheng Gao

TL;DR
This paper introduces FFVT, a pure transformer framework for fine-grained visual categorization that effectively fuses multi-level features using a novel token selection method, achieving state-of-the-art results.
Contribution
The paper proposes a new transformer-based model with a token selection module to incorporate local and low-level features for FGVC, surpassing previous CNN-based methods.
Findings
FFVT achieves state-of-the-art performance on three FGVC benchmarks.
The mutual attention weight selection (MAWS) effectively selects discriminative tokens.
The model enhances local feature representation without extra parameters.
Abstract
The core for tackling the fine-grained visual categorization (FGVC) is to learn subtle yet discriminative features. Most previous works achieve this by explicitly selecting the discriminative parts or integrating the attention mechanism via CNN-based approaches.However, these methods enhance the computational complexity and make the modeldominated by the regions containing the most of the objects. Recently, vision trans-former (ViT) has achieved SOTA performance on general image recognition tasks. Theself-attention mechanism aggregates and weights the information from all patches to the classification token, making it perfectly suitable for FGVC. Nonetheless, the classifi-cation token in the deep layer pays more attention to the global information, lacking the local and low-level features that are essential for FGVC. In this work, we proposea novel pure transformer-based framework…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Dropout · Layer Normalization · Byte Pair Encoding · Vision Transformer
