ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain   Adaptation with an Astronomy Case Study

Eric Modesitt; Ke Yang; Spencer Hulsey; Chengxiang Zhai; Volodymyr; Kindratenko

arXiv:2412.14436·cs.CL·December 20, 2024

ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

Eric Modesitt, Ke Yang, Spencer Hulsey, Chengxiang Zhai, Volodymyr, Kindratenko

PDF

Open Access 1 Repo 1 Video

TL;DR

ORBIT is a cost-effective method for creating high-quality, domain-specific datasets from noisy sources, significantly improving large language model performance in specialized fields like astronomy, law, and medicine.

Contribution

We introduce ORBIT, a novel methodology for efficient domain-specific dataset curation that enhances large language model adaptation with minimal cost and effort.

Findings

01

Improved astronomy benchmark scores from 69% to 76%.

02

Abstract

Recent advances in language modeling demonstrate the need for high-quality domain-specific training data, especially for tasks that require specialized knowledge. General-purpose models, while versatile, often lack the depth needed for expert-level tasks because of limited domain-specific information. Domain adaptation training can enhance these models, but it demands substantial, high-quality data. To address this, we propose ORBIT, a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources, tailored for training specialist large language models. Using astronomy as a primary case study, we refined the 1.3T-token FineWeb-Edu dataset into a high-quality, 10B-token subset focused on astronomy. Fine-tuning \textsc{LLaMA-3-8B} on a 1B-token astronomy subset improved performance on the MMLU astronomy benchmark from 69\% to 76\% and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

modeeric/orbit-llama
pytorchOfficial

Videos

ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques