Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model
David F. Ramirez, Tim Overman, Kristen Jaskie, Andreas Spanias

TL;DR
This paper introduces SMART-HC-VQA, a large-scale Sentinel-2 dataset for geospatial-temporal activity analysis using multimodal large language models, enabling reasoning about remote sensing activities over time.
Contribution
It creates a novel VQA dataset from construction site data, along with a multi-image training framework for language-guided remote sensing analysis.
Findings
Dataset contains 21,837 Sentinel-2 image chips and 2.3 million temporal comparison examples.
Developed a multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B.
Provides a reproducible foundation for reasoning about remote sensing activities.
Abstract
We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
