Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting
Da Zhang, Bingyu Li, Feiyu Wang, Zhiyuan Zhao, Junyu Gao

TL;DR
This paper introduces QICA, a framework enhancing zero-shot object counting by integrating quantity perception and spatial aggregation, improving fine-grained reasoning and generalization across unseen categories and domains.
Contribution
QICA combines quantity perception with spatial aggregation, using a novel prompting strategy and cost aggregation decoder to improve zero-shot counting accuracy and robustness.
Findings
Achieves competitive results on FSC-147 dataset.
Demonstrates superior zero-shot generalization on CARPK and ShanghaiTech-A.
Effectively maintains numerical consistency across the pipeline.
Abstract
Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model adaptation.To address these challenges, we present \textbf{QICA}, a novel framework that synergizes \underline{q}uantity percept\underline{i}on with robust spatial \underline{c}ast \underline{a}ggregation. Specifically, we introduce a Synergistic Prompting Strategy (\textbf{SPS}) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications
