Learning Algorithms for Keyphrase Extraction
Peter D. Turney (National Research Council of Canada)

TL;DR
This paper compares machine learning algorithms for automatic keyphrase extraction, demonstrating that a custom algorithm with domain knowledge outperforms general-purpose methods, achieving about 80% human-acceptable keyphrases.
Contribution
It introduces the GenEx algorithm, a domain-specific method for keyphrase extraction that surpasses general algorithms like C4.5 in performance.
Findings
GenEx outperforms C4.5 in keyphrase quality
Approximately 80% of generated keyphrases are human-acceptable
Domain-specific algorithms improve keyphrase extraction results
Abstract
Many academic journals ask their authors to provide a list of about five to fifteen keywords, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful, as we discuss in this paper. We approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. We evaluate the performance of nine different configurations of C4.5. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for automatically extracting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Biomedical Text Mining and Ontologies · Information Retrieval and Search Behavior
