TL;DR
This paper introduces W2-Net, a framework for referring expression counting that decouples object counting and attribute localization, significantly improving accuracy by focusing on attribute-specific regions and enhancing subclass separation.
Contribution
The paper proposes W2-Net with a dual-query mechanism and Subclass Separable Matching, addressing the challenge of attribute-specific localization in referring expression counting.
Findings
Reduces counting error by 22.5% on REC-8K dataset
Improves localization F1 by 7-8%
Outperforms state-of-the-art methods significantly
Abstract
Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for "walking"). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into "what to count" and "where to see" via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
