Statistical keyword detection in literary corpora

Juan P. Herrera; Pedro A. Pury

arXiv:cs/0701028·cs.CL·June 7, 2008

Statistical keyword detection in literary corpora

Juan P. Herrera, Pedro A. Pury

PDF

Open Access

TL;DR

This paper introduces a statistical method using Shannon's entropy to automatically detect and rank keywords in literary texts, demonstrated on Darwin's The Origin of Species, and compares its effectiveness with existing methods.

Contribution

It presents a novel keyword detection technique based on spatial word distribution and entropy, with calibration using shuffled texts, advancing automatic keyword extraction methods.

Findings

01

Effective keyword detection demonstrated on Darwin's text

02

Comparison shows improved performance over existing methods

03

Shuffled texts used for calibration enhances accuracy

Abstract

Understanding the complexity of human language requires an appropriate analysis of the statistical distribution of words in texts. We consider the information retrieval problem of detecting and ranking the relevant words of a text by means of statistical information referring to the "spatial" use of the words. Shannon's entropy of information is used as a tool for automatic keyword extraction. By using The Origin of Species by Charles Darwin as a representative text sample, we show the performance of our detector and compare it with another proposals in the literature. The random shuffled text receives special attention as a tool for calibrating the ranking indices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques