T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting

Yifei Qian; Zhongliang Guo; Bowen Deng; Chun Tong Lei; Shuai Zhao; Chun Pong Lau; Xiaopeng Hong; Michael P. Pound

arXiv:2502.20625·cs.CV·October 28, 2025

T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting

Yifei Qian, Zhongliang Guo, Bowen Deng, Chun Tong Lei, Shuai Zhao, Chun Pong Lau, Xiaopeng Hong, Michael P. Pound

PDF

TL;DR

T2ICount introduces a diffusion-based framework with hierarchical semantic correction and regional coherence loss to improve zero-shot object counting by enhancing text-image alignment and sensitivity.

Contribution

The paper proposes T2ICount, a novel diffusion-based approach with modules that refine cross-modal understanding and introduces a challenging dataset subset for better evaluation.

Findings

01

Achieves superior performance on multiple benchmarks.

02

Effectively refines text-image alignment through hierarchical correction.

03

Demonstrates improved sensitivity to text prompts in counting tasks.

Abstract

Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.