CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting
Md Tanvir Hossain, Akif Islam, and Mohd Ruhul Ameen

TL;DR
CountFormer leverages a foundation model and transformer architecture to improve exemplar-free object counting by enhancing structural consistency and reducing overcounting errors in complex scenes.
Contribution
This work demonstrates that foundation-based representations can significantly improve structural consistency in exemplar-free object counting tasks.
Findings
Achieves competitive performance on FSC-147 benchmark
Reduces part-level overcounting errors for complex objects
Highlights the importance of representation quality in counting accuracy
Abstract
Humans can often count unfamiliar objects by observing visual repetition and composition, rather than relying only on object categories. However, many exemplar-free counting models struggle in such situations and may overcount when objects contain symmetric components, repeated substructures, or partial occlusion. We introduce CountFormer, a controlled adaptation of a density-regression framework inspired by CounTR, where the image encoder is replaced with the self-supervised vision foundation model DINOv2. The resulting transformer features are combined with explicit two-dimensional positional embeddings and decoded by a lightweight convolutional network to produce a density map whose integral gives the final count. Our goal is not to propose a new counting architecture, but to study whether foundation-based representations improve structural consistency under a strictly exemplar-free…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
