AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

Claire Chen; Jiabao Sean Xiao; Shuze Daniel Liu; Facundo Perez Paolino; Luke Handley; Theophile Jegou du Laz; Ricky Nilsson; Alice Zou; Matthew Graham; Ashish Mahabal

arXiv:2605.05573·astro-ph.IM·May 8, 2026

AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

Claire Chen, Jiabao Sean Xiao, Shuze Daniel Liu, Facundo Perez Paolino, Luke Handley, Theophile Jegou du Laz, Ricky Nilsson, Alice Zou, Matthew Graham, Ashish Mahabal

PDF

TL;DR

AstroAlertBench is a new benchmark for evaluating multimodal LLMs in astronomical classification, focusing on accuracy, reasoning, and honesty across a three-stage logical process using real-world alert data.

Contribution

It introduces a comprehensive benchmark and evaluation protocol for assessing the performance and interpretability of multimodal LLMs in astronomical event review.

Findings

01

High accuracy models often lack honesty in reasoning.

02

Benchmark reveals gaps in model self-evaluation capabilities.

03

Framework supports development of calibrated, interpretable astronomical assistants.

Abstract

Modern astronomical observatories generate a massive volume of multimodal data, creating a critical bottleneck for expert human review. While multimodal large language models (LLMs) have shown promise in interpreting complex visual and textual inputs, their ability to perform specialized scientific classification while providing interpretable reasoning remains understudied. We introduce AstroAlertBench, a comprehensive multimodal benchmark designed to evaluate LLM performance in astronomical event review along a three-stage logical chain: metadata grounding, scientific reasoning, and hierarchical classification over five categories. We use a pilot sample of 1,500 real-world alerts from the Zwicky Transient Facility (ZTF), a wide-field survey that scans the northern sky to detect transient astronomical events. On this dataset, we benchmark 13 frontier closed-source and open-weight LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.