Toward Vision-Language Assistants for Radio Astronomical Source Analysis
S. Riggi

TL;DR
This paper evaluates vision-language models for radio astronomy tasks, introduces radio-llava as a domain-adapted assistant, and analyzes their performance and limitations in scientific source analysis.
Contribution
It presents radio-llava, a fine-tuned multimodal model for radio astronomy, and provides a comprehensive evaluation of VLMs in this scientific domain.
Findings
Commercial models outperform open-weight VLMs in zero-shot tasks.
Radio-llava significantly improves task performance over base models.
Fine-tuning causes catastrophic forgetting, reducing general multimodal performance.
Abstract
Vision-language models (VLMs) have recently shown promise in general-purpose reasoning tasks, yet their applicability to domain-specific scientific workflows remains largely unexplored. In this work, we evaluated a series of open-weight and commercial VLMs on six tasks relevant to radio astronomy, such as source morphology classification. We also introduced radio-llava, a fine-tuned multimodal assistant built on the LLaVA architecture and adapted for the radio domain through instruction fine-tuning. In zero-shot mode, commercial models like GPT-4.1 outperform open-weight VLMs on most radio benchmarks. However, radio-llava significantly improves upon both base LLaVA and commercial models across nearly all tasks. Despite these gains, specialized vision-only models still deliver substantially better performance across the board. Additionally, we observed that fine-tuning introduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
