Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

Jun Wang; Xiaohan Yu; Yongsheng Gao

arXiv:2107.02341·cs.CV·March 2, 2022·84 cites

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

Jun Wang, Xiaohan Yu, Yongsheng Gao

PDF

Open Access 1 Repo

TL;DR

This paper introduces FFVT, a pure transformer framework for fine-grained visual categorization that effectively fuses multi-level features using a novel token selection method, achieving state-of-the-art results.

Contribution

The paper proposes a new transformer-based model with a token selection module to incorporate local and low-level features for FGVC, surpassing previous CNN-based methods.

Findings

01

FFVT achieves state-of-the-art performance on three FGVC benchmarks.

02

The mutual attention weight selection (MAWS) effectively selects discriminative tokens.

03

The model enhances local feature representation without extra parameters.

Abstract

The core for tackling the fine-grained visual categorization (FGVC) is to learn subtle yet discriminative features. Most previous works achieve this by explicitly selecting the discriminative parts or integrating the attention mechanism via CNN-based approaches.However, these methods enhance the computational complexity and make the modeldominated by the regions containing the most of the objects. Recently, vision trans-former (ViT) has achieved SOTA performance on general image recognition tasks. Theself-attention mechanism aggregates and weights the information from all patches to the classification token, making it perfectly suitable for FGVC. Nonetheless, the classifi-cation token in the deep layer pays more attention to the global information, lacking the local and low-level features that are essential for FGVC. In this work, we proposea novel pure transformer-based framework…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Markin-Wang/FFVT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Dropout · Layer Normalization · Byte Pair Encoding · Vision Transformer