Effective Listings of Function Stop words for Twitter
Murphy Choy

TL;DR
This paper proposes a new technique using combinatorial values to develop effective stop words lists specifically for Twitter data, addressing the challenges of high repetition and inconsistency in stop words.
Contribution
It introduces a novel method based on combinatorial values for identifying stop words tailored to Twitter's unique textual characteristics.
Findings
New stop words list for Twitter created
Improved removal of non-informative words in Twitter text
Enhanced text mining accuracy on Twitter data
Abstract
Many words in documents recur very frequently but are essentially meaningless as they are used to join words together in a sentence. It is commonly understood that stop words do not contribute to the context or content of textual documents. Due to their high frequency of occurrence, their presence in text mining presents an obstacle to the understanding of the content in the documents. To eliminate the bias effects, most text mining software or approaches make use of stop words list to identify and remove those words. However, the development of such top words list is difficult and inconsistent between textual sources. This problem is further aggravated by sources such as Twitter which are highly repetitive or similar in nature. In this paper, we will be examining the original work using term frequency, inverse document frequency and term adjacency for developing a stop words list for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
