An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain
Jo\~ao Daniel Silva, Joao Magalhaes, Devis Tuia, Bruno Martins

TL;DR
This paper introduces GeoMELT, a compact encoder-only model designed for multi-task vision and language applications in remote sensing, achieving effective performance with reduced computational costs.
Contribution
The paper proposes a novel encoder-only architecture for multi-task remote sensing vision-language tasks, emphasizing efficiency and effectiveness over traditional large models.
Findings
GeoMELT outperforms existing models on benchmark tasks
The model is more parameter-efficient and computationally less demanding
Effective in tasks like image captioning and cross-modal retrieval
Abstract
The remote sensing community has recently seen the emergence of methods based on Large Vision and Language Models (LVLMs) that can address multiple tasks at the intersection of computer vision and natural language processing. To fully exploit the potential of such models, a significant focus has been given to the collection of large amounts of training data that cover multiple remote sensing-specific tasks, such as image captioning or visual question answering. However, the cost of using and training LVLMs is high, due to the large number of parameters. While multiple parameter-efficient adaptation techniques have been explored, the computational costs of training and inference with these models can remain prohibitive for most institutions. In this work, we explore the use of encoder-only architectures and propose a model that can effectively address multi-task learning while remaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
