A systematic evaluation of vision-language models for observational astronomical reasoning tasks
Wenke Ren, Hengxiao Guo, Wenwen Zuo, Xiaoman Zhang

TL;DR
This paper systematically evaluates vision-language models for diverse astronomical data interpretation tasks, highlighting modality-dependent performance and the importance of physical grounding for trustworthy scientific reasoning.
Contribution
It introduces AstroVLBench, a comprehensive benchmark for VLMs in astronomy, and analyzes how physical knowledge and prompt design affect model accuracy and reliability.
Findings
Performance varies significantly across modalities.
Physical grounding improves model accuracy and bias.
Numerical data presentation enhances reasoning performance.
Abstract
Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities remains untested. We present AstroVLBench, a comprehensive benchmark comprising over 4,100 expert-verified instances across five tasks spanning optical imaging, radio interferometry, multi-wavelength photometry, time-domain light curves, and optical spectroscopy. Evaluating six frontier models, we find that performance is strongly modality-dependent: while one model (Gemini 3 Pro) emerges as the most consistently capable across tasks, task-specific strengths vary, and all models substantially underperform domain-specialized methods. Mechanistic ablations reveal that performance depends not only on directing attention to salient visual features but also on grounding those features in physical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
