CCTrans: Simplifying and Improving Crowd Counting with Transformer
Ye Tian, Xiangxiang Chu, Hongpeng Wang

TL;DR
CCTrans leverages a transformer-based architecture to effectively model global context in crowd counting, achieving state-of-the-art results and simplifying the traditional CNN-based pipeline.
Contribution
Introduces CCTrans, a transformer-based crowd counting model with a pyramid vision transformer backbone and feature aggregation, surpassing previous methods in accuracy.
Findings
Achieves new state-of-the-art results on multiple benchmarks.
Ranks No.1 on NWPU-Crowd leaderboard.
Effective in both weakly and fully-supervised settings.
Abstract
Most recent methods used for crowd counting are based on the convolutional neural network (CNN), which has a strong ability to extract local features. But CNN inherently fails in modeling the global context due to the limited receptive fields. However, the transformer can model the global context easily. In this paper, we propose a simple approach called CCTrans to simplify the design pipeline. Specifically, we utilize a pyramid vision transformer backbone to capture the global crowd information, a pyramid feature aggregation (PFA) model to combine low-level and high-level features, an efficient regression head with multi-scale dilated convolution (MDC) to predict density maps. Besides, we tailor the loss functions for our pipeline. Without bells and whistles, extensive experiments demonstrate that our method achieves new state-of-the-art results on several benchmarks both in weakly and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Anomaly Detection Techniques and Applications · Fire Detection and Safety Systems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Layer Normalization · Dense Connections · Dilated Convolution · Residual Connection · Vision Transformer · Convolution
