Duluth UROP at SemEval-2018 Task 2: Multilingual Emoji Prediction with Ensemble Learning and Oversampling
Shuning Jin, Ted Pedersen

TL;DR
This paper presents a multilingual emoji prediction system using ensemble classifiers with oversampling to handle data skewness, achieving competitive results in SemEval-2018 Task 2.
Contribution
The authors developed an ensemble learning approach with oversampling for multilingual emoji prediction, demonstrating significant performance improvements after simple preprocessing adjustments.
Findings
Ensemble classifiers improved prediction accuracy.
Oversampling helped address data imbalance.
Preprocessing changes significantly boosted results.
Abstract
This paper describes the Duluth UROP systems that participated in SemEval--2018 Task 2, Multilingual Emoji Prediction. We relied on a variety of ensembles made up of classifiers using Naive Bayes, Logistic Regression, and Random Forests. We used unigram and bigram features and tried to offset the skewness of the data through the use of oversampling. Our task evaluation results place us 19th of 48 systems in the English evaluation, and 5th of 21 in the Spanish. After the evaluation we realized that some simple changes to preprocessing could significantly improve our results. After making these changes we attained results that would have placed us sixth in the English evaluation, and second in the Spanish.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Digital Communication and Language · Natural Language Processing Techniques
MethodsLogistic Regression
