A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications
Eric Jeangirard

TL;DR
This paper introduces a large, multilingual dataset of scientific paragraphs with detailed classifications, supporting advancements in scientific text mining and language-specific NLP tasks.
Contribution
It provides a publicly available, richly annotated dataset of 833k paragraphs in multiple languages, facilitating research in scientific literature classification and entity recognition.
Findings
Dataset covers English, French, and European languages.
Includes annotations for language and scientific domain.
Enables training of classification and NER models.
Abstract
We present a dataset of 833k paragraphs extracted from CC-BY licensed scientific publications, classified into four categories: acknowledgments, data mentions, software/code mentions, and clinical trial mentions. The paragraphs are primarily in English and French, with additional European languages represented. Each paragraph is annotated with language identification (using fastText) and scientific domain (from OpenAlex). This dataset, derived from the French Open Science Monitor corpus and processed using GROBID, enables training of text classification models and development of named entity recognition systems for scientific literature mining. The dataset is publicly available on HuggingFace https://doi.org/10.57967/hf/6679 under a CC-BY license.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Text and Document Classification Technologies
