VLCounter: Text-aware Visual Representation for Zero-Shot Object   Counting

Seunggu Kang; WonJun Moon; Euiyeon Kim; Jae-Pil Heo

arXiv:2312.16580·cs.CV·January 2, 2024·1 cites

VLCounter: Text-aware Visual Representation for Zero-Shot Object Counting

Seunggu Kang, WonJun Moon, Euiyeon Kim, Jae-Pil Heo

PDF

Open Access 1 Repo 1 Video

TL;DR

VLCounter introduces a one-stage, text-aware visual counting model that leverages CLIP embeddings and novel modules to improve zero-shot object counting accuracy and generalization without relying on error-prone two-stage pipelines.

Contribution

The paper proposes VLCounter, an end-to-end framework with three novel modules, for zero-shot object counting that outperforms existing methods and reduces error propagation.

Findings

01

VLCounter achieves superior performance on FSC147, CARPK, and PUCPR+ datasets.

02

The proposed modules enhance target localization and counting accuracy for unseen classes.

03

End-to-end training improves robustness and generalization in zero-shot counting tasks.

Abstract

Zero-Shot Object Counting (ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, an one-stage baseline, Visual-Language Baseline (VLBase), exploring the implicit association of the semantic-patch embeddings of CLIP is proposed. Subsequently, the extension of VLBase to Visual-language Counter (VLCounter) is achieved by incorporating three modules devised to tailor VLBase for object counting. First, Semantic-conditioned Prompt Tuning (SPT) is introduced within the image encoder to acquire target-highlighted representations. Second, Learnable Affine Transformation (LAT) is employed to translate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seunggu0305/vlcounter
pytorchOfficial

Videos

VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting· underline

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Human Pose and Action Recognition · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training