Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning

Kevin Lee; Russell Spiewak; James Walsh

arXiv:2511.20694·cs.AI·February 10, 2026

Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning

Kevin Lee, Russell Spiewak, James Walsh

PDF

Open Access

TL;DR

This paper introduces a heliophysics dataset and benchmark for evaluating scientific reasoning in large language models, emphasizing the importance of structured reasoning, physical assumptions, and format consistency.

Contribution

It provides a new dataset derived from NASA and UCAR problem sets and benchmarks various reasoning approaches, highlighting the effectiveness of workflow decomposition.

Findings

01

Decomposing reasoning workflows improves model performance.

02

Multi-agent systems outperform direct prompting in deductive reasoning.

03

The dataset enables structured evaluation of scientific reasoning.

Abstract

Scientific reasoning through Large Language Models in heliophysics involves more than just recalling facts: it requires incorporating physical assumptions, maintaining consistent units, and providing clear scientific formats through coordinated approaches. To address these challenges, we present Reasoning With a Star, a newly contributed heliophysics dataset applicable to reasoning; we also provide an initial benchmarking approach. Our data are constructed from National Aeronautics and Space Administration & University Corporation for Atmospheric Research Living With a Star summer school problem sets and compiled into a readily consumable question-and-answer structure with question contexts, reasoning steps, expected answer type, ground-truth targets, format hints, and metadata. A programmatic grader checks the predictions using unit-aware numerical tolerance, symbolic equivalence, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Topic Modeling · Multimodal Machine Learning Applications