Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model
Di Wang, Qiming Zhang, Yufei Xu, Jing Zhang, Bo Du, Dacheng Tao and, Liangpei Zhang

TL;DR
This paper introduces a large vision transformer model tailored for remote sensing tasks, utilizing a novel rotated varied-size window attention mechanism to improve efficiency and performance across detection, classification, and segmentation tasks.
Contribution
The paper proposes a large-scale plain vision transformer with a new rotated varied-size window attention for remote sensing, demonstrating superior performance and efficiency over existing models.
Findings
Achieved 81.24% mAP on DOTA-V1.0 detection dataset.
Outperformed state-of-the-art models in remote sensing detection tasks.
Showed competitive results in classification and segmentation tasks.
Abstract
Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers being the primary choice due to their good scalability and representation ability. However, large-scale models in remote sensing (RS) have not yet been sufficiently explored. In this paper, we resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models tailored to RS tasks and investigate how such large models perform. To handle the large sizes and objects of arbitrary orientations in RS images, we propose a new rotated varied-size window attention to replace the original full attention in transformers, which can significantly reduce the computational cost and memory footprint while learning better object representation by extracting rich context from the generated diverse windows. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote-Sensing Image Classification · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
