Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale   Benchmark

Bing Cao; Quanhao Lu; Jiekang Feng; Qilong Wang; Qinghua Hu; Pengfei; Zhu

arXiv:2411.13056·cs.CV·March 7, 2025

Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark

Bing Cao, Quanhao Lu, Jiekang Feng, Qilong Wang, Qinghua Hu, Pengfei, Zhu

PDF

Open Access

TL;DR

This paper introduces an efficient masked autoencoder framework for video object counting that leverages density maps and optical flow to improve accuracy, and presents a new large-scale bird counting dataset.

Contribution

The paper proposes a novel density-embedded masked autoencoder method with spatial adaptive masking and temporal fusion, along with a large-scale bird counting dataset for natural scenarios.

Findings

01

Outperforms existing methods on crowd datasets

02

Achieves high counting accuracy with density and temporal modeling

03

Introduces the DroneBird dataset for bird counting in natural environments

Abstract

The dynamic imbalance of the fore-background is a major challenge in video object counting, which is usually caused by the sparsity of target objects. This remains understudied in existing works and often leads to severe under-/over-prediction errors. To tackle this issue in video object counting, we propose a density-embedded Efficient Masked Autoencoder Counting (E-MAC) framework in this paper. To empower the model's representation ability on density regression, we develop a new $D$ ensity- $E$ mbedded $M$ asked m $O$ deling ( $DEMO$ ) method, which first takes the density map as an auxiliary modality to perform multimodal self-representation learning for image and density map. Although $DEMO$ contributes to effective cross-modal regression guidance, it also brings in redundant background information, making it difficult to focus on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Brain Tumor Detection and Classification · Advanced Image and Video Retrieval Techniques

MethodsL1 Regularization · Adaptive Masking · Focus