Extraction of Keyphrases from Text: Evaluation of Four Algorithms
Peter D. Turney (National Research Council of Canada)

TL;DR
This paper empirically evaluates four algorithms for automatic keyphrase extraction across multiple document collections, finding NRC's Extractor performs best in matching human-generated keyphrases.
Contribution
It provides a comparative analysis of four keyphrase extraction algorithms using diverse datasets, highlighting NRC's Extractor as the most effective.
Findings
NRC's Extractor outperforms other algorithms in matching manual keyphrases.
Evaluation across five document collections demonstrates consistent results.
The study offers insights into the effectiveness of different keyphrase extraction methods.
Abstract
This report presents an empirical evaluation of four algorithms for automatically extracting keywords and keyphrases from documents. The four algorithms are compared using five different collections of documents. For each document, we have a target set of keyphrases, which were generated by hand. The target keyphrases were generated for human readers; they were not tailored for any of the four keyphrase extraction algorithms. Each of the algorithms was evaluated by the degree to which the algorithm's keyphrases matched the manually generated keyphrases. The four algorithms were (1) the AutoSummarize feature in Microsoft's Word 97, (2) an algorithm based on Eric Brill's part-of-speech tagger, (3) the Summarize feature in Verity's Search 97, and (4) NRC's Extractor algorithm. For all five document collections, NRC's Extractor yields the best match with the manually generated keyphrases.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Information Retrieval and Search Behavior · Mathematics, Computing, and Information Processing
