ArXiv-to-Model: A Practical Study of Scientific LM Training
Anuj Gupta

TL;DR
This paper provides a comprehensive case study on training a 1.36B scientific language model from raw arXiv LaTeX sources, detailing the pipeline, challenges, and insights for researchers with limited compute resources.
Contribution
It offers an end-to-end, transparent account of training a scientific language model from raw data, highlighting practical engineering considerations and bottlenecks.
Findings
Preprocessing choices greatly influence usable token volume.
Tokenization affects symbolic stability in training.
Storage and I/O constraints can rival compute as bottlenecks.
Abstract
While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Scientific Computing and Data Management
