MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning
Run-Ze Fan, Zengzhi Wang, Pengfei Liu

TL;DR
MegaScience introduces a large-scale, high-quality scientific reasoning dataset and evaluation system, significantly improving model performance and training efficiency for scientific AI across multiple disciplines.
Contribution
The paper presents MegaScience, a comprehensive scientific reasoning dataset and evaluation framework, enabling better training and assessment of AI models in scientific domains.
Findings
Models trained on MegaScience outperform existing datasets.
MegaScience improves training efficiency with concise responses.
Larger models benefit more from MegaScience data.
Abstract
Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗MegaScience/Llama3.1-8B-MegaSciencemodel· 8 dl· ♡ 18 dl♡ 1
- 🤗MegaScience/Qwen2.5-1.5B-MegaSciencemodel· 22 dl· ♡ 222 dl♡ 2
- 🤗MegaScience/Qwen2.5-3B-MegaSciencemodel· 17 dl17 dl
- 🤗MegaScience/Qwen2.5-7B-MegaSciencemodel· 3 dl3 dl
- 🤗MegaScience/Qwen3-1.7B-MegaSciencemodel· 7 dl· ♡ 37 dl♡ 3
- 🤗MegaScience/Qwen3-4B-MegaSciencemodel· 31 dl· ♡ 531 dl♡ 5
- 🤗MegaScience/Qwen3-8B-MegaSciencemodel· 8 dl· ♡ 28 dl♡ 2
- 🤗MegaScience/Qwen3-14B-MegaSciencemodel· 16 dl· ♡ 616 dl♡ 6
- 🤗MegaScience/Qwen3-30B-A3B-MegaSciencemodel· 593 dl· ♡ 9593 dl♡ 9
- 🤗ryanfortin/community-blend-qwen3-8bmodel· 7 dl7 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Machine Learning in Materials Science
