Gauravarora@HASOC-Dravidian-CodeMix-FIRE2020: Pre-training ULMFiT on Synthetically Generated Code-Mixed Data for Hate Speech Detection
Gaurav Arora

TL;DR
This paper presents a method for hate speech detection in Dravidian languages using ULMFiT pre-trained on synthetically generated code-mixed data, achieving high F1-scores and ranking second in the competition.
Contribution
The novel approach involves pre-training ULMFiT on synthetically generated code-mixed data modeled as a Markov process, improving hate speech detection in Dravidian languages.
Findings
Achieved 0.88 F1-score for Tamil-English in Sub-task B
Ranked 2nd on the leaderboard for Tamil-English
Achieved 0.91 F1-score for Malayalam-English in Sub-task A
Abstract
This paper describes the system submitted to Dravidian-Codemix-HASOC2020: Hate Speech and Offensive Content Identification in Dravidian languages (Tamil-English and Malayalam-English). The task aims to identify offensive language in code-mixed dataset of comments/posts in Dravidian languages collected from social media. We participated in both Sub-task A, which aims to identify offensive content in mixed-script (mixture of Native and Roman script) and Sub-task B, which aims to identify offensive content in Roman script, for Dravidian languages. In order to address these tasks, we proposed pre-training ULMFiT on synthetically generated code-mixed data, generated by modelling code-mixed data generation as a Markov process using Markov chains. Our model achieved 0.88 weighted F1-score for code-mixed Tamil-English language in Sub-task B and got 2nd rank on the leader-board. Additionally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
MethodsTanh Activation · Dropout · Sigmoid Activation · Weight Tying · Embedding Dropout · Long Short-Term Memory · Activation Regularization · Variational Dropout · Discriminative Fine-Tuning · Slanted Triangular Learning Rates
