AstroMLab 1: Who Wins Astronomy Jeopardy!?
Yuan-Sen Ting, Tuan Dung Nguyen, Tirthankar Ghosal, Rui Pan, Hardik, Arora, Zechang Sun, Tijmen de Haan, Nesar Ramachandra, Azton Wells, Sandeep, Madireddy, Alberto Accomazzi

TL;DR
This paper evaluates large language models on a new astronomy-specific benchmark, revealing rapid improvements, performance variations across topics, and potential for affordable deployment in astronomical research.
Contribution
It introduces a comprehensive astronomy question dataset and benchmarks multiple models, highlighting performance trends, calibration quality, and implications for research deployment.
Findings
Claude-3.5-Sonnet achieves 85% accuracy, outperforming competitors.
Open-weights models like LLaMA-3-70b now rival proprietary models.
Models show regional and topic-based performance variations.
Abstract
We present a comprehensive evaluation of proprietary and open-weights large language models using the first astronomy-specific benchmarking dataset. This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics, covering a broad range of astrophysical topics. Our analysis examines model performance across various astronomical subfields and assesses response calibration, crucial for potential deployment in research environments. Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy. For proprietary models, we observed a universal reduction in cost every 3-to-12 months to achieve similar score in this particular astronomy benchmark. open-weights models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models. We identify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗AstroMLab/astrollama-2-7b-base_abstractmodel· 10 dl10 dl
- 🤗AstroMLab/astrollama-2-70b-base_aicmodel· 13 dl· ♡ 213 dl♡ 2
- 🤗AstroMLab/astrollama-2-7b-chat_aicmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗AstroMLab/astrollama-3-8b-base_aicmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗AstroMLab/astrollama-3-8b-chat_aicmodel· 6 dl6 dl
- 🤗AstroMLab/astrollama-3-8b-base_summarymodel· 10 dl10 dl
- 🤗AstroMLab/astrollama-3-8b-chat_summarymodel· 6 dl· ♡ 16 dl♡ 1
- 🤗AstroMLab/astrollama-2-7b-base_aicmodel· 1 dl1 dl
- 🤗RichardErkhov/AstroMLab_-_astrollama-3-8b-chat_summary-ggufmodel· 22 dl22 dl
- 🤗RichardErkhov/AstroMLab_-_astrollama-3-8b-base_aic-ggufmodel· 110 dl· ♡ 1110 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHistory and Developments in Astronomy · Space exploration and regulation · Economic Growth and Productivity
