AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model

Tijmen de Haan; Yuan-Sen Ting; Tirthankar Ghosal; Tuan Dung Nguyen; Alberto Accomazzi; Emily Herron; Vanessa Lama; Rui Pan; Azton Wells; Nesar Ramachandra

arXiv:2505.17592·astro-ph.IM·February 23, 2026

AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model

Tijmen de Haan, Yuan-Sen Ting, Tirthankar Ghosal, Tuan Dung Nguyen, Alberto Accomazzi, Emily Herron, Vanessa Lama, Rui Pan, Azton Wells, Nesar Ramachandra

PDF

2 Models

TL;DR

AstroSage-Llama-3.1-70B, a domain-specialized large language model for astronomy, achieves top performance on research questions, surpassing general models and demonstrating the value of domain adaptation in AI.

Contribution

Introduces AstroSage-Llama-3.1-70B, a 70-billion parameter astronomy-focused LLM with extensive training and reasoning capabilities, outperforming generalist models in astronomy tasks.

Findings

01

Achieves 89.0% accuracy on AstroMLab-1 benchmark.

02

Matches performance of GPT-5.2 and Claude-4.5-Opus.

03

More cost-efficient than comparable models.

Abstract

General-purpose large language models (LLMs), despite their broad capabilities, often struggle with specialized domain knowledge. This gap hinders their deployment as reliable research agents in demanding fields such as astronomy. Building on our prior work with AstroSage-Llama-3.1-8B, this study introduces AstroSage-Llama-3.1-70B, a 70-billion parameter domain-specialized natural-language AI assistant. It is designed for research and education across astronomy, astrophysics, space science, astroparticle physics, cosmology, and astronomical instrumentation. Developed from the Meta-Llama-3.1-70B foundation, AstroSage-Llama-3.1-70B underwent extensive continued pre-training (CPT) on a vast corpus of astronomical literature, followed by supervised fine-tuning (SFT) and model merging. We integrated reasoning chains into the SFT dataset, enabling AstroSage-Llama-3.1-70B to either answer the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsShrink and Fine-Tune