SMAuC -- The Scientific Multi-Authorship Corpus
Janek Bevendorff, Philipp Sauer, Lukas Gienapp, Wolfgang Kircheis,, Erik K\"orner, Benno Stein, Martin Potthast

TL;DR
SMAuC is the largest openly accessible scientific corpus with extensive metadata, designed to facilitate research on authorship analysis across multiple disciplines and authorship scenarios.
Contribution
It introduces SMAuC, a comprehensive, metadata-rich dataset of over 3 million scientific publications, enabling advanced authorship analysis research.
Findings
Largest open-access scientific corpus for authorship analysis
Includes extensive, curated author metadata
Supports cross-disciplinary authorship research
Abstract
The rapidly growing volume of scientific publications offers an interesting challenge for research on methods for analyzing the authorship of documents with one or more authors. However, most existing datasets lack scientific documents or the necessary metadata for constructing new experiments and test cases. We introduce SMAuC, a comprehensive, metadata-rich corpus tailored to scientific authorship analysis. Comprising over 3 million publications across various disciplines from over 5 million authors, SMAuC is the largest openly accessible corpus for this purpose. It encompasses scientific texts from humanities and natural sciences, accompanied by extensive, curated metadata, including unambiguous author IDs. SMAuC aims to significantly advance the domain of authorship analysis in scientific texts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Topic Modeling · Biomedical Text Mining and Ontologies
MethodsTest
