Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting
Zhicheng Wang, Liwen Xiao, Zhiguo Cao, Hao Lu

TL;DR
This paper introduces CACViT, a simple yet effective vision transformer-based approach for class-agnostic counting that outperforms existing methods by simplifying the pipeline into a single pretrained ViT with scale and magnitude embeddings.
Contribution
The work demonstrates that class-agnostic counting can be effectively performed with a plain pretrained ViT using an extract-and-match approach, simplifying the existing pipeline.
Findings
CACViT outperforms state-of-the-art methods with 23.60% error reduction.
The approach generalizes well across datasets.
Simple embeddings improve scale and magnitude handling.
Abstract
Class-agnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively and then matching their feature similarity, leading to an extract-then-match paradigm. In this work, we show that CAC can be simplified in an extract-and-match manner, particularly using a vision transformer (ViT) where feature extraction and similarity matching are executed simultaneously within the self-attention. We reveal the rationale of such simplification from a decoupled view of the self-attention. The resulting model, termed CACViT, simplifies the CAC pipeline into a single pretrained plain ViT. Further, to compensate the loss of the scale and the order-of-magnitude information due to resizing and normalization in plain ViT, we present two effective strategies for scale and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · Dementia and Cognitive Impairment Research
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer
