Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot   Class-Agnostic Counting

Zhicheng Wang; Liwen Xiao; Zhiguo Cao; Hao Lu

arXiv:2305.04440·cs.CV·March 5, 2024·1 cites

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

Zhicheng Wang, Liwen Xiao, Zhiguo Cao, Hao Lu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces CACViT, a simple yet effective vision transformer-based approach for class-agnostic counting that outperforms existing methods by simplifying the pipeline into a single pretrained ViT with scale and magnitude embeddings.

Contribution

The work demonstrates that class-agnostic counting can be effectively performed with a plain pretrained ViT using an extract-and-match approach, simplifying the existing pipeline.

Findings

01

CACViT outperforms state-of-the-art methods with 23.60% error reduction.

02

The approach generalizes well across datasets.

03

Simple embeddings improve scale and magnitude handling.

Abstract

Class-agnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively and then matching their feature similarity, leading to an extract-then-match paradigm. In this work, we show that CAC can be simplified in an extract-and-match manner, particularly using a vision transformer (ViT) where feature extraction and similarity matching are executed simultaneously within the self-attention. We reveal the rationale of such simplification from a decoupled view of the self-attention. The resulting model, termed CACViT, simplifies the CAC pipeline into a single pretrained plain ViT. Further, to compensate the loss of the scale and the order-of-magnitude information due to resizing and normalization in plain ViT, we present two effective strategies for scale and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Xu3XiWang/CACViT
pytorchOfficial

Videos

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting· underline

Taxonomy

TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · Dementia and Cognitive Impairment Research

MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer