Siamese Transformer Networks for Few-shot Image Classification
Weihao Jiang, Shuoxi Zhang, Kun He

TL;DR
This paper introduces a Siamese Transformer Network that combines global and local features using pre-trained Vision Transformers for improved few-shot image classification, demonstrating superior results on multiple benchmarks.
Contribution
The paper proposes a novel STN architecture that integrates global and local features with a meta-learning training strategy, enhancing few-shot classification performance.
Findings
Achieves superior accuracy on four benchmarks.
Effectively combines global and local features.
Outperforms state-of-the-art methods in 5-shot and 1-shot scenarios.
Abstract
Humans exhibit remarkable proficiency in visual classification tasks, accurately recognizing and classifying new images with minimal examples. This ability is attributed to their capacity to focus on details and identify common features between previously seen and new images. In contrast, existing few-shot image classification methods often emphasize either global features or local features, with few studies considering the integration of both. To address this limitation, we propose a novel approach based on the Siamese Transformer Network (STN). Our method employs two parallel branch networks utilizing the pre-trained Vision Transformer (ViT) architecture to extract global and local features, respectively. Specifically, we implement the ViT-Small network architecture and initialize the branch networks with pre-trained model parameters obtained through self-supervised learning. We apply…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Spectroscopy Techniques in Biomedical and Chemical Research
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Vision Transformer
