TL;DR
This paper introduces CAFe-DINO, a zero-shot open-vocabulary semantic segmentation model for remote sensing imagery that leverages DINOv3's strong foundation without domain-specific fine-tuning.
Contribution
It develops a novel RS segmentation approach using DINOv3's capabilities, achieving state-of-the-art results without RS-specific fine-tuning.
Findings
CAFe-DINO outperforms fine-tuned OVSS methods on RS datasets.
DINOv3's backbone enables effective zero-shot RS segmentation.
The model is trained on a subset of COCO-Stuff and performs well on RS imagery.
Abstract
The remote sensing (RS) domain suffers from a lack of densely labeled datasets, which are costly to obtain. Thus, models that can segment RS imagery well without supervised fine-tuning are valuable, but existing solutions fall behind supervised methods. Recently, DINOv3 surpassed SOTA RS foundation models on the GEO-bench segmentation benchmark without pre-training on RS data. Additionally, DINO.txt has enabled open vocabulary semantic segmentation (OVSS) with the DINOv3 backbone. We leverage these developments to form an OVSS model for RS imagery, free of RS-domain fine-tuning. Our model, CAFe-DINO (Cost Aggregation + Feature Upsampling with DINO) exploits the strong OVSS performance of DINOv3 for RS imagery via cost aggregation and training-free upsampling of text-image similarity scores. The robust latent of the DINOv3 backbone eliminates the need for fine-tuning on RS imagery; we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
