Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning
Kevin Lee, Russell Spiewak, James Walsh

TL;DR
This paper introduces a heliophysics dataset and benchmark for evaluating scientific reasoning in large language models, emphasizing the importance of structured reasoning, physical assumptions, and format consistency.
Contribution
It provides a new dataset derived from NASA and UCAR problem sets and benchmarks various reasoning approaches, highlighting the effectiveness of workflow decomposition.
Findings
Decomposing reasoning workflows improves model performance.
Multi-agent systems outperform direct prompting in deductive reasoning.
The dataset enables structured evaluation of scientific reasoning.
Abstract
Scientific reasoning through Large Language Models in heliophysics involves more than just recalling facts: it requires incorporating physical assumptions, maintaining consistent units, and providing clear scientific formats through coordinated approaches. To address these challenges, we present Reasoning With a Star, a newly contributed heliophysics dataset applicable to reasoning; we also provide an initial benchmarking approach. Our data are constructed from National Aeronautics and Space Administration & University Corporation for Atmospheric Research Living With a Star summer school problem sets and compiled into a readily consumable question-and-answer structure with question contexts, reasoning steps, expected answer type, ground-truth targets, format hints, and metadata. A programmatic grader checks the predictions using unit-aware numerical tolerance, symbolic equivalence, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Topic Modeling · Multimodal Machine Learning Applications
