Learning to Extract Keyphrases from Text
Peter D. Turney (National Research Council of Canada)

TL;DR
This paper presents a supervised learning approach for automatic keyphrase extraction from text, introducing the GenEx algorithm which outperforms existing commercial tools and general-purpose algorithms.
Contribution
The paper introduces the GenEx algorithm specifically designed for keyphrase extraction, demonstrating its superior performance over general algorithms and commercial software.
Findings
GenEx outperforms C4.5 decision trees in keyphrase extraction.
GenEx surpasses Microsoft's Word 97 and Verity's Search 97 in relevant tasks.
Specialized learning algorithms improve keyphrase extraction quality.
Abstract
Many academic journals ask their authors to provide a list of about five to fifteen key words, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a surprisingly wide variety of tasks for which keyphrases are useful, as we discuss in this paper. Recent commercial software, such as Microsoft's Word 97 and Verity's Search 97, includes algorithms that automatically extract keyphrases from documents. In this paper, we approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. The second set of experiments applies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Text and Document Classification Technologies · Information Retrieval and Search Behavior
