Learning to Count without Annotations
Lukas Knobel, Tengda Han, Yuki M. Asano

TL;DR
UnCounTR is a novel unsupervised model for reference-based object counting that uses self-generated training samples called Self-Collages, eliminating the need for manual annotations and achieving competitive performance.
Contribution
This paper introduces Self-Collages and leverages unsupervised representations to enable reference-based counting without manual labels, outperforming basic baselines.
Findings
Outperforms simple baselines and generic models like FasterRCNN and DETR
Matches supervised counting models in certain domains
Demonstrates the first successful unsupervised reference-based counting
Abstract
While recent supervised methods for reference-based object counting continue to improve the performance on benchmark datasets, they have to rely on small datasets due to the cost associated with manually annotating dozens of objects in images. We propose UnCounTR, a model that can learn this task without requiring any manual annotations. To this end, we construct "Self-Collages", images with various pasted objects as training samples, that provide a rich learning signal covering arbitrary object types and counts. Our method builds on existing unsupervised representations and segmentation techniques to successfully demonstrate for the first time the ability of reference-based counting without manual supervision. Our experiments show that our method not only outperforms simple baselines and generic models such as FasterRCNN and DETR, but also matches the performance of supervised counting…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. The authors propose a method to generate synthetic data for object counting and implement an unsupervised object counting approach. 2. They utilize the DINO backbone to create a counting model similar to CounTR.
1. The approach of creating synthetic data by copying segmentation results from one image to another is a well-known technique in segmentation [1]. However, this paper applies it to object counting. 2. The trained model's performance is not satisfactory, particularly in FSC-147 high, which is the primary objective of counting dense and small objects. 3. The motivation for the counting task is to abstract information from dense scenes that detection models struggle with, particularly partial and
The unsupervised counting task is a challenging task, and it is appealing to see the authors propose a practical way. The proposed method even outperforms simple baselines and generic models such as FasterRCNN and DETR.
1. The experiments are not convincing. There are two pioneering works (CrowdCLIP[1] and CSCCNN ) that also focus on the unsupervised counting task. However, the authors do not discuss or compare with them. I would like to see a comprehensive comparison. 2. The evaluated FSC-147 dataset is not very challenging. I suggest the authors try to conduct experiments on the crowd datasets, which are usually dense and challenging. Compared with CrowdCLIP[1] and CSC-CCNN[2] will make the paper more solid.
1. This manuscript is sound in making adequate explaination to the results and experimental analysis; 2. The writing quality of this manuscript is ok to make me get the points.
1. The motivation of this work is poor. I am still confused on why we should build such a sythetic dataset from others to get some supervision signal to train a unsupervised counter. 2. From the data perspective, these generated data are without double to be filled with artefact and the solution in this manuscript is just the copy-paste, whose contribution is limited. 3. The counting model utilized in this manuscript is not totally original, which seems to be the DINO + ViT. 4. It is evident
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Data Stream Mining Techniques · Advanced Database Systems and Queries
