ArXiv-to-Model: A Practical Study of Scientific LM Training

Anuj Gupta

arXiv:2602.17288·cs.AI·February 20, 2026

ArXiv-to-Model: A Practical Study of Scientific LM Training

Anuj Gupta

PDF

Open Access 1 Models 2 Datasets

TL;DR

This paper provides a comprehensive case study on training a 1.36B scientific language model from raw arXiv LaTeX sources, detailing the pipeline, challenges, and insights for researchers with limited compute resources.

Contribution

It offers an end-to-end, transparent account of training a scientific language model from raw data, highlighting practical engineering considerations and bottlenecks.

Findings

01

Preprocessing choices greatly influence usable token volume.

02

Tokenization affects symbolic stability in training.

03

Storage and I/O constraints can rival compute as bottlenecks.

Abstract

While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
KiteFishAI/Minnow-Math-1.5B
model· 171k dl· ♡ 1
171k dl♡ 1

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Scientific Computing and Data Management