Cleaning English Abstracts of Scientific Publications

Michael E. Rose; Nils A. Herrmann; Sebastian Erhardt

arXiv:2512.24459·cs.CL·January 1, 2026

Cleaning English Abstracts of Scientific Publications

Michael E. Rose, Nils A. Herrmann, Sebastian Erhardt

PDF

Open Access

TL;DR

This paper presents an open-source language model that automatically cleans scientific abstracts by removing extraneous information, thereby improving the quality of textual analysis and similarity assessments.

Contribution

It introduces a novel, easy-to-integrate model specifically designed to clean scientific abstracts, enhancing downstream text analysis tasks.

Findings

01

Model effectively removes extraneous content from abstracts.

02

Cleaning improves similarity ranking accuracy.

03

Enhanced embeddings better capture abstract content.

Abstract

Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAcademic Publishing and Open Access · Topic Modeling · scientometrics and bibliometrics research