Impact of Feature Selection on Micro-Text Classification
Ankit Vadehra, Maura R. Grossman, Gordon V. Cormack

TL;DR
This study investigates how different feature selection methods at the word and character levels affect the performance of micro-text classification on Twitter data, revealing that character-level features outperform word-level features.
Contribution
It provides a comparative analysis of word versus character-level feature extraction methods on Twitter micro-text classification, highlighting the superior performance of character-level features.
Findings
Character-level features outperform word-level features.
Pre-processing methods like stemming and lemmatization do not improve performance.
Simple character-level groups yield better results in multi-class classification.
Abstract
Social media datasets, especially Twitter tweets, are popular in the field of text classification. Tweets are a valuable source of micro-text (sometimes referred to as "micro-blogs"), and have been studied in domains such as sentiment analysis, recommendation systems, spam detection, clustering, among others. Tweets often include keywords referred to as "Hashtags" that can be used as labels for the tweet. Using tweets encompassing 50 labels, we studied the impact of word versus character-level feature selection and extraction on different learners to solve a multi-class classification task. We show that feature extraction of simple character-level groups performs better than simple word groups and pre-processing methods like normalizing using Porter's Stemming and Part-of-Speech ("POS")-Lemmatization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Topic Modeling · Spam and Phishing Detection
