How Document Pre-processing affects Keyphrase Extraction Performance

Florian Boudin; Hugo Mougard; Damien Cram

arXiv:1610.07809·cs.CL·October 26, 2016·2 cites

How Document Pre-processing affects Keyphrase Extraction Performance

Florian Boudin, Hugo Mougard, Damien Cram

PDF

Open Access 1 Repo

TL;DR

This paper investigates how different document preprocessing techniques influence the effectiveness of automatic keyphrase extraction from scientific articles, emphasizing the importance of preprocessing for optimal performance.

Contribution

It provides a systematic evaluation of various preprocessing methods and their impact on keyphrase extraction accuracy, highlighting the need for careful preprocessing in scientific text analysis.

Findings

01

Preprocessing significantly affects keyphrase extraction performance.

02

Robustness of models varies with preprocessing complexity.

03

Certain preprocessing steps improve extraction accuracy.

Abstract

The SemEval-2010 benchmark dataset has brought renewed attention to the task of automatic keyphrase extraction. This dataset is made up of scientific articles that were automatically converted from PDF format to plain text and thus require careful preprocessing so that irrevelant spans of text do not negatively affect keyphrase extraction performance. In previous work, a wide range of document preprocessing techniques were described but their impact on the overall performance of keyphrase extraction models is still unexplored. Here, we re-assess the performance of several keyphrase extraction models and measure their robustness against increasingly sophisticated levels of document preprocessing.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

boudinfl/semeval-2010-pre
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Sentiment Analysis and Opinion Mining · Biomedical Text Mining and Ontologies