SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing
Luca Foppiano, Sotaro Takeshita, Pedro Ortiz Suarez, Ekaterina Borisova, Raia Abu Ahmad, Malte Ostendorff, Fabio Barth, Julian Moreno-Schneider, Georg Rehm

TL;DR
SciLaD is a large, open-source scientific language dataset with over 10 million English and 35 million multilingual publications, enabling reproducible research and pre-training of scientific language models.
Contribution
It introduces a novel, large-scale, openly available scientific dataset and an extensible pipeline, demonstrating high data quality and enabling reproducible scientific NLP research.
Findings
Pre-trained RoBERTa on SciLaD achieves competitive performance.
Dataset and pipeline promote transparency and reproducibility.
Validates the utility of open-source tools for large-scale scientific data curation.
Abstract
SciLaD is a novel, large-scale dataset of scientific language constructed entirely using open-source frameworks and publicly available data sources. It comprises a curated English split containing over 10 million scientific publications and a multilingual, unfiltered TEI XML split including more than 35 million publications. We also publish the extensible pipeline for generating SciLaD. The dataset construction and processing workflow demonstrates how open-source tools can enable large-scale, scientific data curation while maintaining high data quality. Finally, we pre-train a RoBERTa model on our dataset and evaluate it across a comprehensive set of benchmarks, achieving performance comparable to other scientific language models of similar size, validating the quality and utility of SciLaD. We publish the dataset and evaluation pipeline to promote reproducibility, transparency, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsResearch Data Management Practices · Scientific Computing and Data Management · Digital Humanities and Scholarship
