TL;DR
SpectraDINO is a novel multispectral vision model that extends RGB foundation models to NIR, SWIR, and LWIR modalities using lightweight adapters and a multi-stage training protocol, achieving state-of-the-art results.
Contribution
It introduces a lightweight, modality-specific adapter approach combined with a multi-stage distillation training method to adapt RGB models for multispectral vision tasks.
Findings
SpectraDINO outperforms existing methods on multispectral object detection and segmentation benchmarks.
The model effectively bridges the spectral gap while preserving RGB priors.
State-of-the-art performance across multiple multispectral benchmarks.
Abstract
Vision Foundation Models (VFMs) pretrained on large-scale RGB data have demonstrated remarkable representation quality, yet their applicability to multispectral imaging spanning Near-Infrared (NIR), Short-Wave Infrared (SWIR), and Long-Wave Infrared (LWIR) remains largely unexplored. These spectral modalities offer complementary sensing capabilities critical for robust perception in adverse conditions, but present a fundamental domain gap relative to RGB-centric pretrained models. We present SpectraDINO, a multispectral VFM that bridges this spectral gap by extending DINOv2 ViT backbones to beyond-visible modalities through lightweight, per-modality bottleneck adapters, while preserving the rich representations of the frozen RGB backbone. We introduce a multi-stage teacher-student training protocol in which a frozen DINOv2 teacher guides a spectral student via cosine distillation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
