A Case Study of Spanish Text Transformations for Twitter Sentiment Analysis
Eric S. Tellez, Sabino Miranda-Jim\'enez, Mario Graff, Daniela, Moctezuma, Oscar S. Siodia, and Elio A. Villase\~nor

TL;DR
This study investigates how various text transformations and tokenization strategies affect Spanish Twitter sentiment analysis accuracy, introducing a novel word and character n-gram combination that improves classifier performance.
Contribution
It systematically analyzes the impact of text preprocessing techniques and introduces a new combined n-gram approach that enhances sentiment classifier accuracy.
Findings
The combined word and character n-gram approach outperforms traditional methods by up to 11.17%.
Text transformations significantly influence classifier accuracy.
The exhaustive analysis identifies key characteristics of effective text preprocessing.
Abstract
Sentiment analysis is a text mining task that determines the polarity of a given text, i.e., its positiveness or negativeness. Recently, it has received a lot of attention given the interest in opinion mining in micro-blogging platforms. These new forms of textual expressions present new challenges to analyze text given the use of slang, orthographic and grammatical errors, among others. Along with these challenges, a practical sentiment classifier should be able to handle efficiently large workloads. The aim of this research is to identify which text transformations (lemmatization, stemming, entity removal, among others), tokenizers (e.g., words -grams), and tokens weighting schemes impact the most the accuracy of a classifier (Support Vector Machine) trained on two Spanish corpus. The methodology used is to exhaustively analyze all the combinations of the text transformations and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
