How Document Pre-processing affects Keyphrase Extraction Performance
Florian Boudin, Hugo Mougard, Damien Cram

TL;DR
This paper investigates how different document preprocessing techniques influence the effectiveness of automatic keyphrase extraction from scientific articles, emphasizing the importance of preprocessing for optimal performance.
Contribution
It provides a systematic evaluation of various preprocessing methods and their impact on keyphrase extraction accuracy, highlighting the need for careful preprocessing in scientific text analysis.
Findings
Preprocessing significantly affects keyphrase extraction performance.
Robustness of models varies with preprocessing complexity.
Certain preprocessing steps improve extraction accuracy.
Abstract
The SemEval-2010 benchmark dataset has brought renewed attention to the task of automatic keyphrase extraction. This dataset is made up of scientific articles that were automatically converted from PDF format to plain text and thus require careful preprocessing so that irrevelant spans of text do not negatively affect keyphrase extraction performance. In previous work, a wide range of document preprocessing techniques were described but their impact on the overall performance of keyphrase extraction models is still unexplored. Here, we re-assess the performance of several keyphrase extraction models and measure their robustness against increasingly sophisticated levels of document preprocessing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Sentiment Analysis and Opinion Mining · Biomedical Text Mining and Ontologies
