A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications

Eric Jeangirard

arXiv:2510.21762·cs.CL·October 28, 2025

A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications

Eric Jeangirard

PDF

Open Access

TL;DR

This paper introduces a large, multilingual dataset of scientific paragraphs with detailed classifications, supporting advancements in scientific text mining and language-specific NLP tasks.

Contribution

It provides a publicly available, richly annotated dataset of 833k paragraphs in multiple languages, facilitating research in scientific literature classification and entity recognition.

Findings

01

Dataset covers English, French, and European languages.

02

Includes annotations for language and scientific domain.

03

Enables training of classification and NER models.

Abstract

We present a dataset of 833k paragraphs extracted from CC-BY licensed scientific publications, classified into four categories: acknowledgments, data mentions, software/code mentions, and clinical trial mentions. The paragraphs are primarily in English and French, with additional European languages represented. Each paragraph is annotated with language identification (using fastText) and scientific domain (from OpenAlex). This dataset, derived from the French Open Science Monitor corpus and processed using GROBID, enables training of text classification models and development of named entity recognition systems for scientific literature mining. The dataset is publicly available on HuggingFace https://doi.org/10.57967/hf/6679 under a CC-BY license.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Text and Document Classification Technologies