Robust Scene Change Detection Using Visual Foundation Models and Cross-Attention Mechanisms
Chun-Jung Lin, Sourav Garg, Tat-Jun Chin, Feras Dayoub

TL;DR
This paper introduces a scene change detection method that combines a visual foundation model with cross-attention mechanisms, achieving robustness against lighting, seasonal, and viewpoint variations, and demonstrating superior performance on benchmark datasets.
Contribution
The paper proposes a novel approach that uses a frozen backbone and full-image cross-attention for improved scene change detection, enhancing generalization and robustness.
Findings
Significant F1-score improvements on benchmark datasets.
Robustness against photometric and geometric variations.
Superior generalization over existing methods.
Abstract
We present a novel method for scene change detection that leverages the robust feature extraction capabilities of a visual foundational model, DINOv2, and integrates full-image cross-attention to address key challenges such as varying lighting, seasonal variations, and viewpoint differences. In order to effectively learn correspondences and mis-correspondences between an image pair for the change detection task, we propose to a) ``freeze'' the backbone in order to retain the generality of dense foundation features, and b) employ ``full-image'' cross-attention to better tackle the viewpoint variations between the image pair. We evaluate our approach on two benchmark datasets, VL-CMU-CD and PSCD, along with their viewpoint-varied versions. Our experiments demonstrate significant improvements in F1-score, particularly in scenarios involving geometric changes between image pairs. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Vision and Imaging · Advanced Image and Video Retrieval Techniques
