Principal Phrase Mining
Ellie Small, Javier Cabrera

TL;DR
This paper introduces a novel method for extracting meaningful, non-double-counted phrases from texts without requiring human input or quality phrase lists, addressing a key challenge in phrase mining.
Contribution
The proposed approach automatically identifies principal phrases by eliminating double-counting, enhancing phrase extraction accuracy without human intervention or pre-existing quality phrase lists.
Findings
Effectively eliminates double-counting in phrase extraction
Identifies meaningful principal phrases independently
Operates efficiently on various text collections
Abstract
Extracting frequent words from a collection of texts is commonly performed in many subjects. However, as useful as it is to obtain a collection of commonly occurring words from texts, there is a need for more specific information to be obtained from texts in the form of most commonly occurring phrases. Despite this need, extracting frequent phrases is not commonly done due to inherent complications, the most significant being double-counting. Double-counting occurs when words or phrases are counted when they appear inside longer phrases that themselves are also counted, resulting in a selection of mostly meaningless phrases that are frequent only because they occur inside frequent super phrases. Several papers have been written on phrase mining that describe solutions to this issue; however, they either require a list of so-called quality phrases to be available to the extracting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques
