An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain

Jo\~ao Daniel Silva; Joao Magalhaes; Devis Tuia; Bruno Martins

arXiv:2512.15531·cs.CV·December 18, 2025

An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain

Jo\~ao Daniel Silva, Joao Magalhaes, Devis Tuia, Bruno Martins

PDF

Open Access

TL;DR

This paper introduces GeoMELT, a compact encoder-only model designed for multi-task vision and language applications in remote sensing, achieving effective performance with reduced computational costs.

Contribution

The paper proposes a novel encoder-only architecture for multi-task remote sensing vision-language tasks, emphasizing efficiency and effectiveness over traditional large models.

Findings

01

GeoMELT outperforms existing models on benchmark tasks

02

The model is more parameter-efficient and computationally less demanding

03

Effective in tasks like image captioning and cross-modal retrieval

Abstract

The remote sensing community has recently seen the emergence of methods based on Large Vision and Language Models (LVLMs) that can address multiple tasks at the intersection of computer vision and natural language processing. To fully exploit the potential of such models, a significant focus has been given to the collection of large amounts of training data that cover multiple remote sensing-specific tasks, such as image captioning or visual question answering. However, the cost of using and training LVLMs is high, due to the large number of parameters. While multiple parameter-efficient adaptation techniques have been explored, the computational costs of training and inference with these models can remain prohibitive for most institutions. In this work, we explore the use of encoder-only architectures and propose a model that can effectively address multi-task learning while remaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning