ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study
Eric Modesitt, Ke Yang, Spencer Hulsey, Chengxiang Zhai, Volodymyr, Kindratenko

TL;DR
ORBIT is a cost-effective method for creating high-quality, domain-specific datasets from noisy sources, significantly improving large language model performance in specialized fields like astronomy, law, and medicine.
Contribution
We introduce ORBIT, a novel methodology for efficient domain-specific dataset curation that enhances large language model adaptation with minimal cost and effort.
Findings
Improved astronomy benchmark scores from 69% to 76%.
Top results on AstroBench benchmark.
Model outperformed base LLaMA-3-8B in astronomy tasks.
Abstract
Recent advances in language modeling demonstrate the need for high-quality domain-specific training data, especially for tasks that require specialized knowledge. General-purpose models, while versatile, often lack the depth needed for expert-level tasks because of limited domain-specific information. Domain adaptation training can enhance these models, but it demands substantial, high-quality data. To address this, we propose ORBIT, a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources, tailored for training specialist large language models. Using astronomy as a primary case study, we refined the 1.3T-token FineWeb-Edu dataset into a high-quality, 10B-token subset focused on astronomy. Fine-tuning \textsc{LLaMA-3-8B} on a 1B-token astronomy subset improved performance on the MMLU astronomy benchmark from 69\% to 76\% and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
