CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

Md Tanvir Hossain; Akif Islam; and Mohd Ruhul Ameen

arXiv:2510.23785·cs.CV·March 10, 2026

CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

Md Tanvir Hossain, Akif Islam, and Mohd Ruhul Ameen

PDF

TL;DR

CountFormer leverages a foundation model and transformer architecture to improve exemplar-free object counting by enhancing structural consistency and reducing overcounting errors in complex scenes.

Contribution

This work demonstrates that foundation-based representations can significantly improve structural consistency in exemplar-free object counting tasks.

Findings

01

Achieves competitive performance on FSC-147 benchmark

02

Reduces part-level overcounting errors for complex objects

03

Highlights the importance of representation quality in counting accuracy

Abstract

Humans can often count unfamiliar objects by observing visual repetition and composition, rather than relying only on object categories. However, many exemplar-free counting models struggle in such situations and may overcount when objects contain symmetric components, repeated substructures, or partial occlusion. We introduce CountFormer, a controlled adaptation of a density-regression framework inspired by CounTR, where the image encoder is replaced with the self-supervised vision foundation model DINOv2. The resulting transformer features are combined with explicit two-dimensional positional embeddings and decoded by a lightweight convolutional network to produce a density map whose integral gives the final count. Our goal is not to propose a new counting architecture, but to study whether foundation-based representations improve structural consistency under a strictly exemplar-free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.