SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

Xiao Yang; Ronghao Fu; Zhiwen Lin; Zhuoran Duan; Jiashun Zhu; Jiasen Hu; Lang Sun; Weipeng Zhang; Jiaqi Liu; Xu Na; Haoran Liu; Weijie Zhang; Bo Yang

arXiv:2605.17949·cs.CV·May 19, 2026

SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

Xiao Yang, Ronghao Fu, Zhiwen Lin, Zhuoran Duan, Jiashun Zhu, Jiasen Hu, Lang Sun, Weipeng Zhang, Jiaqi Liu, Xu Na, Haoran Liu, Weijie Zhang, Bo Yang

PDF

TL;DR

SkyNative introduces an encoder-free, native multimodal framework for remote sensing that enhances fine-grained spatial reasoning and robustness by directly representing images as raw patches within a language-model space.

Contribution

It proposes a novel encoder-free architecture with modality-aware decoupling for remote sensing vision-language tasks, improving image-grounded perception and robustness.

Findings

01

SkyNative outperforms traditional models on remote sensing understanding tasks.

02

It demonstrates increased robustness against misleading prompts and language priors.

03

The visual reliance benchmark reveals improved image evidence grounding.

Abstract

Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.