Cleaning English Abstracts of Scientific Publications
Michael E. Rose, Nils A. Herrmann, Sebastian Erhardt

TL;DR
This paper presents an open-source language model that automatically cleans scientific abstracts by removing extraneous information, thereby improving the quality of textual analysis and similarity assessments.
Contribution
It introduces a novel, easy-to-integrate model specifically designed to clean scientific abstracts, enhancing downstream text analysis tasks.
Findings
Model effectively removes extraneous content from abstracts.
Cleaning improves similarity ranking accuracy.
Enhanced embeddings better capture abstract content.
Abstract
Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAcademic Publishing and Open Access · Topic Modeling · scientometrics and bibliometrics research
