AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network
Yu Hu, Jianyang Gu, Hao Liu, Yue Cao, Jozsef Hamari, Zheng Liu, Mohsen Zardadi

TL;DR
AVION is a knowledge distillation framework that adapts vision-language models for remote sensing imagery, improving classification and retrieval tasks by leveraging semantic-rich textual prototypes and prompt-tuning.
Contribution
It introduces a novel framework combining semantic-rich textual prototypes with prompt-tuning for effective remote sensing adaptation of vision-language models.
Findings
Improves few-shot classification accuracy.
Enhances cross-modal retrieval mean recall.
Maintains generalization to novel categories.
Abstract
Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
