Scene Change Detection with Vision-Language Representation Learning
Diwei Sheng, Vijayraj Gohil, Satyam Gaba, Zihan Liu, Giles Hamilton-Fletcher, John-Ross Rizzo, Yongqing Liang, Chen Feng

TL;DR
This paper introduces LangSCD, a novel vision-language framework for scene change detection that enhances accuracy by integrating semantic reasoning and a new large-scale dataset, NYC-CD.
Contribution
The paper proposes a modular language component and a geometric-semantic matching module, advancing scene change detection with semantic reasoning and introducing a new dataset for fine-grained annotations.
Findings
LangSCD achieves state-of-the-art performance on multiple benchmarks.
The semantic reasoning modules improve change detection accuracy.
NYC-CD dataset provides detailed multiclass change annotations for urban scenes.
Abstract
Scene change detection (SCD) is crucial for urban monitoring and navigation but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely primarily on low-level visual features, limiting their ability to accurately identify changed objects amid the visual complexity of urban scenes. In this paper, we propose LangSCD, a vision-language framework for scene change detection that overcomes this single-modal limitation by incorporating semantic reasoning through language. Our approach introduces a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of scene changes, which are fused with visual features through a cross-modal feature enhancer. We further introduce a geometric-semantic matching module that refines the predicted masks by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
