Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

David F. Ramirez; Tim Overman; Kristen Jaskie; Andreas Spanias

arXiv:2605.10739·eess.IV·May 12, 2026

Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

David F. Ramirez, Tim Overman, Kristen Jaskie, Andreas Spanias

PDF

TL;DR

This paper introduces SMART-HC-VQA, a large-scale Sentinel-2 dataset for geospatial-temporal activity analysis using multimodal large language models, enabling reasoning about remote sensing activities over time.

Contribution

It creates a novel VQA dataset from construction site data, along with a multi-image training framework for language-guided remote sensing analysis.

Findings

01

Dataset contains 21,837 Sentinel-2 image chips and 2.3 million temporal comparison examples.

02

Developed a multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B.

03

Provides a reproducible foundation for reasoning about remote sensing activities.

Abstract

We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.