Scene Change Detection with Vision-Language Representation Learning

Diwei Sheng; Vijayraj Gohil; Satyam Gaba; Zihan Liu; Giles Hamilton-Fletcher; John-Ross Rizzo; Yongqing Liang; Chen Feng

arXiv:2604.11402·cs.CV·April 14, 2026

Scene Change Detection with Vision-Language Representation Learning

Diwei Sheng, Vijayraj Gohil, Satyam Gaba, Zihan Liu, Giles Hamilton-Fletcher, John-Ross Rizzo, Yongqing Liang, Chen Feng

PDF

TL;DR

This paper introduces LangSCD, a novel vision-language framework for scene change detection that enhances accuracy by integrating semantic reasoning and a new large-scale dataset, NYC-CD.

Contribution

The paper proposes a modular language component and a geometric-semantic matching module, advancing scene change detection with semantic reasoning and introducing a new dataset for fine-grained annotations.

Findings

01

LangSCD achieves state-of-the-art performance on multiple benchmarks.

02

The semantic reasoning modules improve change detection accuracy.

03

NYC-CD dataset provides detailed multiclass change annotations for urban scenes.

Abstract

Scene change detection (SCD) is crucial for urban monitoring and navigation but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely primarily on low-level visual features, limiting their ability to accurately identify changed objects amid the visual complexity of urban scenes. In this paper, we propose LangSCD, a vision-language framework for scene change detection that overcomes this single-modal limitation by incorporating semantic reasoning through language. Our approach introduces a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of scene changes, which are fused with visual features through a cross-modal feature enhancer. We further introduce a geometric-semantic matching module that refines the predicted masks by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.