Principal Phrase Mining

Ellie Small; Javier Cabrera

arXiv:2206.13748·cs.CL·November 1, 2022·1 cites

Principal Phrase Mining

Ellie Small, Javier Cabrera

PDF

Open Access

TL;DR

This paper introduces a novel method for extracting meaningful, non-double-counted phrases from texts without requiring human input or quality phrase lists, addressing a key challenge in phrase mining.

Contribution

The proposed approach automatically identifies principal phrases by eliminating double-counting, enhancing phrase extraction accuracy without human intervention or pre-existing quality phrase lists.

Findings

01

Effectively eliminates double-counting in phrase extraction

02

Identifies meaningful principal phrases independently

03

Operates efficiently on various text collections

Abstract

Extracting frequent words from a collection of texts is commonly performed in many subjects. However, as useful as it is to obtain a collection of commonly occurring words from texts, there is a need for more specific information to be obtained from texts in the form of most commonly occurring phrases. Despite this need, extracting frequent phrases is not commonly done due to inherent complications, the most significant being double-counting. Double-counting occurs when words or phrases are counted when they appear inside longer phrases that themselves are also counted, resulting in a selection of mostly meaningless phrases that are frequent only because they occur inside frequent super phrases. Several papers have been written on phrase mining that describe solutions to this issue; however, they either require a list of so-called quality phrases to be available to the extracting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques