TL;DR
ConInfer introduces a context-aware inference framework that jointly predicts across spatial units and models their semantic dependencies, significantly improving open-vocabulary remote sensing segmentation accuracy.
Contribution
It is the first to explicitly incorporate global context and inter-unit dependencies in training-free remote sensing segmentation, enhancing consistency and robustness.
Findings
Outperforms state-of-the-art baselines by 2.80% and 6.13% on benchmark datasets.
Enhances segmentation consistency and generalization in complex environments.
Demonstrates effectiveness across multiple remote sensing tasks.
Abstract
Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context-aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
