YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

Guanning Zeng; Xiang Zhang; Zirui Wang; Haiyang Xu; Zeyuan Chen; Bingnan Li; Zhuowen Tu

arXiv:2508.00728·cs.CV·August 4, 2025

YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

Guanning Zeng, Xiang Zhang, Zirui Wang, Haiyang Xu, Zeyuan Chen, Bingnan Li, Zhuowen Tu

PDF

Open Access 1 Models

TL;DR

YOLO-Count introduces a differentiable object counting model that improves text-to-image generation by providing accurate counts and control over object quantities through a novel regression target and hybrid supervision.

Contribution

It presents YOLO-Count, a novel differentiable counting model with a 'cardinality' map for open-vocabulary counting and precise control in T2I generation.

Findings

01

Achieves state-of-the-art counting accuracy.

02

Enables robust quantity control in T2I systems.

03

Demonstrates effective gradient-based optimization.

Abstract

We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation. A core contribution is the 'cardinality' map, a novel regression target that accounts for variations in object size and spatial distribution. Leveraging representation alignment and a hybrid strong-weak supervision scheme, YOLO-Count bridges the gap between open-vocabulary counting and T2I generation control. Its fully differentiable architecture facilitates gradient-based optimization, enabling accurate object count estimation and fine-grained guidance for generative models. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art counting accuracy while providing robust and effective quantity control for T2I systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
zx1239856/yolo-count
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications