A Vision-Language Framework for Multispectral Scene Representation Using   Language-Grounded Features

Enes Karanfil; Nevrez Imamoglu; Erkut Erdem; Aykut Erdem

arXiv:2501.10144·cs.CV·January 20, 2025

A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features

Enes Karanfil, Nevrez Imamoglu, Erkut Erdem, Aykut Erdem

PDF

Open Access

TL;DR

This paper introduces Spectral LLaVA, a novel vision-language framework that integrates multispectral remote sensing data with language grounding to improve scene understanding and description accuracy in complex environments.

Contribution

The paper presents a new multispectral vision-language framework that enhances scene representation and description in remote sensing, leveraging a lightweight alignment layer and frozen vision backbone.

Findings

01

Improved scene description accuracy with multispectral data

02

Enhanced classification performance through semantic feature refinement

03

Effective alignment of multispectral features with language models

Abstract

Scene understanding in remote sensing often faces challenges in generating accurate representations for complex environments such as various land use areas or coastal regions, which may also include snow, clouds, or haze. To address this, we present a vision-language framework named Spectral LLaVA, which integrates multispectral data with vision-language alignment techniques to enhance scene representation and description. Using the BigEarthNet v2 dataset from Sentinel-2, we establish a baseline with RGB-based scene descriptions and further demonstrate substantial improvements through the incorporation of multispectral information. Our framework optimizes a lightweight linear projection layer for alignment while keeping the vision backbone of SpectralGPT frozen. Our experiments encompass scene classification using linear probing and language modeling for jointly performing scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications