Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models
Xiaohe Li, Jiahao Li, Kaixin Zhang, Yuqiang Fang, Leilei Lin, Hong Wang, Haohua Wu, Zide Fan

TL;DR
This paper introduces Delta-LLaVA, a specialized multimodal large language model for remote sensing change detection, supported by a new comprehensive benchmark, Delta-QA, that advances multi-temporal interpretation capabilities.
Contribution
The work presents Delta-LLaVA with novel modules for change detection and a new benchmark, unifying pixel-level segmentation and question answering in remote sensing.
Findings
Delta-LLaVA outperforms existing models in change deduction accuracy.
The Delta-QA benchmark enables comprehensive evaluation of multi-temporal remote sensing tasks.
Proposed modules effectively isolate and amplify visual differences across time.
Abstract
While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
