Statistical keyword detection in literary corpora
Juan P. Herrera, Pedro A. Pury

TL;DR
This paper introduces a statistical method using Shannon's entropy to automatically detect and rank keywords in literary texts, demonstrated on Darwin's The Origin of Species, and compares its effectiveness with existing methods.
Contribution
It presents a novel keyword detection technique based on spatial word distribution and entropy, with calibration using shuffled texts, advancing automatic keyword extraction methods.
Findings
Effective keyword detection demonstrated on Darwin's text
Comparison shows improved performance over existing methods
Shuffled texts used for calibration enhances accuracy
Abstract
Understanding the complexity of human language requires an appropriate analysis of the statistical distribution of words in texts. We consider the information retrieval problem of detecting and ranking the relevant words of a text by means of statistical information referring to the "spatial" use of the words. Shannon's entropy of information is used as a tool for automatic keyword extraction. By using The Origin of Species by Charles Darwin as a representative text sample, we show the performance of our detector and compare it with another proposals in the literature. The random shuffled text receives special attention as a tool for calibrating the ranking indices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques
