An Empirical Evaluation of Text Representation Schemes on Multilingual   Social Web to Filter the Textual Aggression

Sandip Modha; Prasenjit Majumder

arXiv:1904.08770·cs.IR·April 19, 2019·1 cites

An Empirical Evaluation of Text Representation Schemes on Multilingual Social Web to Filter the Textual Aggression

Sandip Modha, Prasenjit Majumder

PDF

Open Access

TL;DR

This study compares various text representation schemes for detecting user aggression and fact verification on multilingual social media data, finding that word embeddings like fastText outperform traditional methods in certain contexts.

Contribution

It provides an empirical comparison of multiple text representation techniques, including BoW, word embeddings, and transfer learning models, on multilingual social media tasks.

Findings

01

BoW outperforms word embeddings on machine learning classifiers.

02

Pre-trained word embeddings like fastText yield the best weighted F1-score.

03

Deep neural models are more robust on lexically different datasets.

Abstract

This paper attempt to study the effectiveness of text representation schemes on two tasks namely: User Aggression and Fact Detection from the social media contents. In User Aggression detection, The aim is to identify the level of aggression from the contents generated in the Social media and written in the English, Devanagari Hindi and Romanized Hindi. Aggression levels are categorized into three predefined classes namely: `Non-aggressive`, `Overtly Aggressive`, and `Covertly Aggressive`. During the disaster-related incident, Social media like, Twitter is flooded with millions of posts. In such emergency situations, identification of factual posts is important for organizations involved in the relief operation. We anticipated this problem as a combination of classification and Ranking problem. This paper presents a comparison of various text representation scheme based on BoW…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Spam and Phishing Detection · Sentiment Analysis and Opinion Mining

MethodsDropout · GloVe Embeddings · Skip-gram Word2Vec · Adam · Sigmoid Activation · Tanh Activation · Temporal Activation Regularization · DropConnect · Long Short-Term Memory · Activation Regularization