Impact of Feature Selection on Micro-Text Classification

Ankit Vadehra; Maura R. Grossman; Gordon V. Cormack

arXiv:1708.08123·cs.IR·August 29, 2017·2 cites

Impact of Feature Selection on Micro-Text Classification

Ankit Vadehra, Maura R. Grossman, Gordon V. Cormack

PDF

Open Access

TL;DR

This study investigates how different feature selection methods at the word and character levels affect the performance of micro-text classification on Twitter data, revealing that character-level features outperform word-level features.

Contribution

It provides a comparative analysis of word versus character-level feature extraction methods on Twitter micro-text classification, highlighting the superior performance of character-level features.

Findings

01

Character-level features outperform word-level features.

02

Pre-processing methods like stemming and lemmatization do not improve performance.

03

Simple character-level groups yield better results in multi-class classification.

Abstract

Social media datasets, especially Twitter tweets, are popular in the field of text classification. Tweets are a valuable source of micro-text (sometimes referred to as "micro-blogs"), and have been studied in domains such as sentiment analysis, recommendation systems, spam detection, clustering, among others. Tweets often include keywords referred to as "Hashtags" that can be used as labels for the tweet. Using tweets encompassing 50 labels, we studied the impact of word versus character-level feature selection and extraction on different learners to solve a multi-class classification task. We show that feature extraction of simple character-level groups performs better than simple word groups and pre-processing methods like normalizing using Porter's Stemming and Part-of-Speech ("POS")-Lemmatization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Topic Modeling · Spam and Phishing Detection