A Multi-Survey Machine-Readable Corpus of Milky Way Globular Cluster Parameters for Retrieval-Augmented Generation Applications
David C. Flynn

TL;DR
This paper introduces a comprehensive, machine-readable database of 174 Milky Way globular clusters combining multiple survey data, designed for use in AI applications and traditional astrophysical analyses.
Contribution
It provides a unified, well-documented corpus integrating diverse survey data for globular clusters, optimized for retrieval-augmented generation and astrophysical research.
Findings
Validated for structured context injection with language models
Contains 17,438 data points across multiple surveys
Supports traditional analyses like orbit modeling and chemical tagging
Abstract
We present the Milky Way Globular Cluster Corpus v1.3.1, a unified machine-readable database of fundamental parameters for 174 Milky Way globular clusters assembled from four independent published surveys. Each cluster record integrates photometric, structural, and spectroscopically-calibrated metallicity parameters from Harris (1996) (2010 revision), Gaia EDR3 proper motions from Vasiliev & Baumgardt (2021), N-body dynamical masses and orbital parameters from Baumgardt et al. (2023), and mean chemical abundances from the APOGEE DR17 globular cluster Value Added Catalog of Schiavon et al. (2024). The corpus contains 17,438 non-null data points across 174 clusters stored in JSONL, JSON, and flat CSV formats with consistent native-typed fields (float, int, bool, null), embedded provenance blocks, and fully documented schema. Survey coverage is 157/174 clusters for Harris photometry,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
