AstroMLab 1: Who Wins Astronomy Jeopardy!?

Yuan-Sen Ting; Tuan Dung Nguyen; Tirthankar Ghosal; Rui Pan; Hardik; Arora; Zechang Sun; Tijmen de Haan; Nesar Ramachandra; Azton Wells; Sandeep; Madireddy; Alberto Accomazzi

arXiv:2407.11194·astro-ph.IM·November 12, 2024·1 cites

AstroMLab 1: Who Wins Astronomy Jeopardy!?

Yuan-Sen Ting, Tuan Dung Nguyen, Tirthankar Ghosal, Rui Pan, Hardik, Arora, Zechang Sun, Tijmen de Haan, Nesar Ramachandra, Azton Wells, Sandeep, Madireddy, Alberto Accomazzi

PDF

Open Access 10 Models 2 Datasets

TL;DR

This paper evaluates large language models on a new astronomy-specific benchmark, revealing rapid improvements, performance variations across topics, and potential for affordable deployment in astronomical research.

Contribution

It introduces a comprehensive astronomy question dataset and benchmarks multiple models, highlighting performance trends, calibration quality, and implications for research deployment.

Findings

01

Claude-3.5-Sonnet achieves 85% accuracy, outperforming competitors.

02

Open-weights models like LLaMA-3-70b now rival proprietary models.

03

Models show regional and topic-based performance variations.

Abstract

We present a comprehensive evaluation of proprietary and open-weights large language models using the first astronomy-specific benchmarking dataset. This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics, covering a broad range of astrophysical topics. Our analysis examines model performance across various astronomical subfields and assesses response calibration, crucial for potential deployment in research environments. Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy. For proprietary models, we observed a universal reduction in cost every 3-to-12 months to achieve similar score in this particular astronomy benchmark. open-weights models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models. We identify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHistory and Developments in Astronomy · Space exploration and regulation · Economic Growth and Productivity