Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark
Bing Cao, Quanhao Lu, Jiekang Feng, Qilong Wang, Qinghua Hu, Pengfei, Zhu

TL;DR
This paper introduces an efficient masked autoencoder framework for video object counting that leverages density maps and optical flow to improve accuracy, and presents a new large-scale bird counting dataset.
Contribution
The paper proposes a novel density-embedded masked autoencoder method with spatial adaptive masking and temporal fusion, along with a large-scale bird counting dataset for natural scenarios.
Findings
Outperforms existing methods on crowd datasets
Achieves high counting accuracy with density and temporal modeling
Introduces the DroneBird dataset for bird counting in natural environments
Abstract
The dynamic imbalance of the fore-background is a major challenge in video object counting, which is usually caused by the sparsity of target objects. This remains understudied in existing works and often leads to severe under-/over-prediction errors. To tackle this issue in video object counting, we propose a density-embedded Efficient Masked Autoencoder Counting (E-MAC) framework in this paper. To empower the model's representation ability on density regression, we develop a new ensity-mbedded asked mdeling () method, which first takes the density map as an auxiliary modality to perform multimodal self-representation learning for image and density map. Although contributes to effective cross-modal regression guidance, it also brings in redundant background information, making it difficult to focus on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Brain Tumor Detection and Classification · Advanced Image and Video Retrieval Techniques
MethodsL1 Regularization · Adaptive Masking · Focus
