Accelerating Text Mining Using Domain-Specific Stop Word Lists
Farah Alshanik, Amy Apon, Alexander Herzog, Ilya Safro, Justin, Sybrandt

TL;DR
This paper introduces a hyperplane-based method for automatically extracting domain-specific stop words, significantly reducing text dimensionality and improving classification performance in text mining tasks.
Contribution
A novel mathematical hyperplane-based approach for automatic domain-specific stop word extraction that outperforms existing methods in efficiency and effectiveness.
Findings
Reduces text dimensionality by up to 90%.
Outperforms mutual information in feature selection.
Significantly lowers computational time for stop word identification.
Abstract
Text preprocessing is an essential step in text mining. Removing words that can negatively impact the quality of prediction algorithms or are not informative enough is a crucial storage-saving technique in text indexing and results in improved computational efficiency. Typically, a generic stop word list is applied to a dataset regardless of the domain. However, many common words are different from one domain to another but have no significance within a particular domain. Eliminating domain-specific common words in a corpus reduces the dimensionality of the feature space, and improves the performance of text mining tasks. In this paper, we present a novel mathematical approach for the automatic extraction of domain-specific words called the hyperplane-based approach. This new approach depends on the notion of low dimensional representation of the word in vector space and its distance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFeature Selection
