TL;DR
RACANet is a novel RGB-T crowd counting framework that explicitly models local spatial discrepancies and modality reliability, leading to improved accuracy and interpretability in complex scenes.
Contribution
The paper introduces a two-stage fusion framework with explicit local alignment and reliability modeling, enhancing cross-modal fusion for crowd counting.
Findings
RACANet outperforms existing methods on benchmark datasets.
The proposed local anchor fusion improves feature aggregation.
Reliability-aware modeling enhances counting accuracy in complex scenes.
Abstract
RGB-Thermal (T) crowd counting aims to integrate visible-spectrum and thermal infrared information to improve the robustness of crowd density estimation in complex scenes. Although existing studies generally improve counting accuracy through cross-modal feature fusion, most current methods rely on implicit cross-modal fusion strategies and lack explicit modeling of local spatial discrepancies as well as fine-grained characterization of modality reliability at the positional level, thereby limiting the accuracy and interpretability of the fusion process. To address these issues, this paper proposes a two-stage fusion framework, RACANet, a Reliability-Aware Crowd Anchor Network for RGB-T crowd counting. First, we introduce a lightweight cross-modal alignment pretraining stage, which explicitly learns cross-modal semantic correspondences through crowd-prior supervision and local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
