Effective Listings of Function Stop words for Twitter

Murphy Choy

arXiv:1205.6396·cs.IR·May 30, 2012

Effective Listings of Function Stop words for Twitter

Murphy Choy

PDF

TL;DR

This paper proposes a new technique using combinatorial values to develop effective stop words lists specifically for Twitter data, addressing the challenges of high repetition and inconsistency in stop words.

Contribution

It introduces a novel method based on combinatorial values for identifying stop words tailored to Twitter's unique textual characteristics.

Findings

01

New stop words list for Twitter created

02

Improved removal of non-informative words in Twitter text

03

Enhanced text mining accuracy on Twitter data

Abstract

Many words in documents recur very frequently but are essentially meaningless as they are used to join words together in a sentence. It is commonly understood that stop words do not contribute to the context or content of textual documents. Due to their high frequency of occurrence, their presence in text mining presents an obstacle to the understanding of the content in the documents. To eliminate the bias effects, most text mining software or approaches make use of stop words list to identify and remove those words. However, the development of such top words list is difficult and inconsistent between textual sources. This problem is further aggravated by sources such as Twitter which are highly repetitive or similar in nature. In this paper, we will be examining the original work using term frequency, inverse document frequency and term adjacency for developing a stop words list for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.