SYRAC: Synthesize, Rank, and Count
Adriano D'Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh

TL;DR
This paper introduces SYRAC, a novel crowd counting method that leverages synthetic data generated by latent diffusion models to reduce annotation effort and achieve state-of-the-art results in unsupervised crowd counting.
Contribution
It proposes a new approach that uses synthetic data from diffusion models to eliminate manual annotations in crowd counting tasks.
Findings
Achieves state-of-the-art results in unsupervised crowd counting.
Effectively uses ranked image pairs for pre-training.
Utilizes synthetic images with known object counts for improved accuracy.
Abstract
Crowd counting is a critical task in computer vision, with several important applications. However, existing counting methods rely on labor-intensive density map annotations, necessitating the manual localization of each individual pedestrian. While recent efforts have attempted to alleviate the annotation burden through weakly or semi-supervised learning, these approaches fall short of significantly reducing the workload. We propose a novel approach to eliminate the annotation burden by leveraging latent diffusion models to generate synthetic data. However, these models struggle to reliably understand object quantities, leading to noisy annotations when prompted to produce images with a specific quantity of objects. To address this, we use latent diffusion models to create two types of synthetic data: one by removing pedestrians from real images, which generates ranked image pairs with…
Peer Reviews
Decision·Submitted to ICLR 2024
The idea is novel and the experimental results demonstrate the advantages of the proposed unsupervised method.
A deep analysis of the experimental results is not provided.
- The idea of utilizing a stable model to generate synthetic images seems feasible. - The paper introduces two strategies: a weak but reliable object quantity signal and a strong but noisy counting signal. This approach seems quite reasonable, as it can potentially complement and enhance the model's performance.
- What is the rationale behind the setting of N, which is the crowd count to generate synthetic images? What is the quality of the generated images? Is it possible to provide a measure of variance to assess the feasibility of this method? - There are only six categories for N. Why not train the model by a classification task? In situations where the labels are not stable, the classification task seems to be able to maintain a relatively high level of accuracy. - The synthetic images do not incl
(a) Using stable diffusion to generate the crowd dataset is a good idea, providing a new perspective for this area. (b) This paper is written well and easy to follow
1. For the fully supervised part, the authors only discuss the density-based crowd counting methods. In other words, many localization-based methods should be discussed, making the related work more comprehensive. 2. The authors have pointed out that the prompt count is not reliable but using it as the GT count directly during the training phase. It makes me confused. I think it would be better to rank the generated 60 images using the pre-trained backbone first. Secondly, fine-tune the GT coun
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Mobility and Location-Based Analysis · Anomaly Detection Techniques and Applications
MethodsLinear Layer · Diffusion
