Decoupling What to Count and Where to See for Referring Expression Counting

Yuda Zou; Zijian Zhang; Yongchao Xu

arXiv:2510.24374·cs.CV·October 29, 2025

Decoupling What to Count and Where to See for Referring Expression Counting

Yuda Zou, Zijian Zhang, Yongchao Xu

PDF

1 Video

TL;DR

This paper introduces W2-Net, a framework for referring expression counting that decouples object counting and attribute localization, significantly improving accuracy by focusing on attribute-specific regions and enhancing subclass separation.

Contribution

The paper proposes W2-Net with a dual-query mechanism and Subclass Separable Matching, addressing the challenge of attribute-specific localization in referring expression counting.

Findings

01

Reduces counting error by 22.5% on REC-8K dataset

02

Improves localization F1 by 7-8%

03

Outperforms state-of-the-art methods significantly

Abstract

Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for "walking"). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into "what to count" and "where to see" via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Decoupling What to Count and Where to See for Referring Expression Counting· underline