VLCounter: Text-aware Visual Representation for Zero-Shot Object Counting
Seunggu Kang, WonJun Moon, Euiyeon Kim, Jae-Pil Heo

TL;DR
VLCounter introduces a one-stage, text-aware visual counting model that leverages CLIP embeddings and novel modules to improve zero-shot object counting accuracy and generalization without relying on error-prone two-stage pipelines.
Contribution
The paper proposes VLCounter, an end-to-end framework with three novel modules, for zero-shot object counting that outperforms existing methods and reduces error propagation.
Findings
VLCounter achieves superior performance on FSC147, CARPK, and PUCPR+ datasets.
The proposed modules enhance target localization and counting accuracy for unseen classes.
End-to-end training improves robustness and generalization in zero-shot counting tasks.
Abstract
Zero-Shot Object Counting (ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, an one-stage baseline, Visual-Language Baseline (VLBase), exploring the implicit association of the semantic-patch embeddings of CLIP is proposed. Subsequently, the extension of VLBase to Visual-language Counter (VLCounter) is achieved by incorporating three modules devised to tailor VLBase for object counting. First, Semantic-conditioned Prompt Tuning (SPT) is introduced within the image encoder to acquire target-highlighted representations. Second, Learnable Affine Transformation (LAT) is employed to translate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Human Pose and Action Recognition · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
