Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
Minh Kha Do, Wei Xiang, Kang Han, Di Wu, Khoa Phan, Yi-Ping Phoebe Chen, Gaowen Liu, Ramana Rao Kompella

TL;DR
SATtxt introduces a spectrum-aware vision-language model for satellite imagery that learns spectral cues during training and operates with RGB inputs at inference, enhancing zero-shot classification and retrieval performance.
Contribution
It proposes a novel two-stage framework combining spectral representation distillation and spectrally grounded alignment with instruction-augmented LLMs for satellite imagery.
Findings
Improves zero-shot classification by 4.2% on average
Enhances retrieval accuracy by 5.9%
Boosts linear probing performance by 2.7%
Abstract
Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote-Sensing Image Classification · Advanced Neural Network Applications · Multimodal Machine Learning Applications
