Feature Selection on Noisy Twitter Short Text Messages for Language Identification
Mohd Zeeshan Ansari, Tanvir Ahmad, Ana Fatima

TL;DR
This paper investigates feature selection techniques for improving language identification in noisy, short Twitter texts, focusing on Hindi-English detection using various algorithms and classifiers to enhance model efficiency.
Contribution
It introduces a comprehensive analysis of feature selection methods on Twitter data for language identification, highlighting their impact on classifier performance.
Findings
Feature selection improves language identification accuracy.
Different algorithms vary in effectiveness depending on the classifier.
Word-level features and n-grams are crucial for performance.
Abstract
The task of written language identification involves typically the detection of the languages present in a sample of text. Moreover, a sequence of text may not belong to a single inherent language but also may be mixture of text written in multiple languages. This kind of text is generated in large volumes from social media platforms due to its flexible and user friendly environment. Such text contains very large number of features which are essential for development of statistical, probabilistic as well as other kinds of language models. The large number of features have rich as well as irrelevant and redundant features which have diverse effect over the performance of the learning model. Therefore, feature selection methods are significant in choosing feature that are most relevant for an efficient model. In this article, we basically consider the Hindi-English language identification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsFeature Selection
