Customized determination of stop words using Random Matrix Theory   approach

Bogdan {\L}obodzi\'nski

arXiv:2104.08642·cs.CL·October 27, 2021·1 cites

Customized determination of stop words using Random Matrix Theory approach

Bogdan {\L}obodzi\'nski

PDF

Open Access

TL;DR

This paper introduces a novel method using Random Matrix Theory and the Brody distribution to identify and customize stop words in texts across any language by analyzing word distance distributions.

Contribution

It proposes a new, agnostic approach to determine stop words based on statistical distribution fitting, enhancing text preprocessing techniques.

Findings

01

Brody distribution effectively models word distance distributions.

02

The method can identify uninformative words with adjustable thresholds.

03

Applicable to texts in any language for customized stop word lists.

Abstract

The distances between words calculated in word units are studied and compared with the distributions of the Random Matrix Theory (RMT). It is found that the distribution of distance between the same words can be well described by the single-parameter Brody distribution. Using the Brody distribution fit, we found that the distance between given words in a set of texts can show mixed dynamics, coexisting regular and chaotic regimes. It is found that distributions correctly fitted by the Brody distribution with a certain goodness of the fit threshold can be identifid as stop words, usually considered as the uninformative part of the text. By applying various threshold values for the goodness of fit, we can extract uninformative words from the texts under analysis to the desired extent. On this basis we formulate a fully agnostic recipe that can be used in the creation of a customized set…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpinion Dynamics and Social Influence · Cellular Automata and Applications · Algorithms and Data Compression