Counting Varying Density Crowds Through Density Guided Adaptive Selection CNN and Transformer Estimation
Yuehai Chen, Jing Yang, Badong Chen, Shaoyi Du

TL;DR
This paper introduces CTASNet, a novel crowd counting model that adaptively combines CNN and Transformer predictions based on density regions, effectively handling varying crowd densities with improved accuracy.
Contribution
The paper proposes a density guided adaptive selection network that dynamically chooses between CNN and Transformer for crowd counting, addressing density variation challenges.
Findings
Outperforms existing methods on four challenging datasets.
Effectively handles both low-density and high-density crowd regions.
Reduces annotation noise impact with a novel loss function.
Abstract
In real-world crowd counting applications, the crowd densities in an image vary greatly. When facing density variation, humans tend to locate and count the targets in low-density regions, and reason the number in high-density regions. We observe that CNN focus on the local information correlation using a fixed-size convolution kernel and the Transformer could effectively extract the semantic crowd information by using the global self-attention mechanism. Thus, CNN could locate and estimate crowds accurately in low-density regions, while it is hard to properly perceive the densities in high-density regions. On the contrary, Transformer has a high reliability in high-density regions, but fails to locate the targets in sparse regions. Neither CNN nor Transformer can well deal with this kind of density variation. To address this problem, we propose a CNN and Transformer Adaptive Selection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Softmax · Dropout · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Multi-Head Attention · Byte Pair Encoding · Label Smoothing
