TimeSenCLIP: A Time Series Vision-Language Model for Remote Sensing
Pallavi Jain, Diego Marcos, Dino Ienco, Roberto Interdonato, and Tristan Berchoux

TL;DR
TimeSenCLIP is a lightweight vision-language model designed for remote sensing time series, focusing on temporal and spectral information to improve land-use and land-cover mapping without relying on textual annotations.
Contribution
It introduces a cross-view temporal contrastive framework that aligns multispectral Sentinel-2 time series with ground imagery, emphasizing temporal and spectral signals over spatial context.
Findings
Prior models depend on caption-based supervision, which is limited.
TimeSenCLIP effectively aligns multispectral time series with ground imagery.
Focusing on temporal and spectral data enhances remote sensing tasks.
Abstract
Vision-language models (VLMs) have shown significant promise in remote sensing applications, particularly for land-use and land-cover (LULC) mapping via zero-shot classification and retrieval. However, current approaches face several key challenges, such as the dependence on caption-based supervision, which is often not available or very limited in terms of the covered semantics, and the fact of being adapted from generic VLM architectures that are suitable for very high resolution images. Consequently, these models tend to prioritize spatial context over spectral and temporal information, limiting their effectiveness for medium-resolution remote sensing imagery. In this work, we present TimeSenCLIP, a lightweight VLM for remote sensing time series, using a cross-view temporal contrastive framework to align multispectral Sentinel-2 time series with geo-tagged ground-level imagery,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
