SMAuC -- The Scientific Multi-Authorship Corpus

Janek Bevendorff; Philipp Sauer; Lukas Gienapp; Wolfgang Kircheis,; Erik K\"orner; Benno Stein; Martin Potthast

arXiv:2211.02477·cs.CL·May 11, 2023

SMAuC -- The Scientific Multi-Authorship Corpus

Janek Bevendorff, Philipp Sauer, Lukas Gienapp, Wolfgang Kircheis,, Erik K\"orner, Benno Stein, Martin Potthast

PDF

Open Access

TL;DR

SMAuC is the largest openly accessible scientific corpus with extensive metadata, designed to facilitate research on authorship analysis across multiple disciplines and authorship scenarios.

Contribution

It introduces SMAuC, a comprehensive, metadata-rich dataset of over 3 million scientific publications, enabling advanced authorship analysis research.

Findings

01

Largest open-access scientific corpus for authorship analysis

02

Includes extensive, curated author metadata

03

Supports cross-disciplinary authorship research

Abstract

The rapidly growing volume of scientific publications offers an interesting challenge for research on methods for analyzing the authorship of documents with one or more authors. However, most existing datasets lack scientific documents or the necessary metadata for constructing new experiments and test cases. We introduce SMAuC, a comprehensive, metadata-rich corpus tailored to scientific authorship analysis. Comprising over 3 million publications across various disciplines from over 5 million authors, SMAuC is the largest openly accessible corpus for this purpose. It encompasses scientific texts from humanities and natural sciences, accompanied by extensive, curated metadata, including unambiguous author IDs. SMAuC aims to significantly advance the domain of authorship analysis in scientific texts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Topic Modeling · Biomedical Text Mining and Ontologies

MethodsTest