Gauravarora@HASOC-Dravidian-CodeMix-FIRE2020: Pre-training ULMFiT on   Synthetically Generated Code-Mixed Data for Hate Speech Detection

Gaurav Arora

arXiv:2010.02094·cs.CL·October 21, 2020·23 cites

Gauravarora@HASOC-Dravidian-CodeMix-FIRE2020: Pre-training ULMFiT on Synthetically Generated Code-Mixed Data for Hate Speech Detection

Gaurav Arora

PDF

Open Access

TL;DR

This paper presents a method for hate speech detection in Dravidian languages using ULMFiT pre-trained on synthetically generated code-mixed data, achieving high F1-scores and ranking second in the competition.

Contribution

The novel approach involves pre-training ULMFiT on synthetically generated code-mixed data modeled as a Markov process, improving hate speech detection in Dravidian languages.

Findings

01

Achieved 0.88 F1-score for Tamil-English in Sub-task B

02

Ranked 2nd on the leaderboard for Tamil-English

03

Achieved 0.91 F1-score for Malayalam-English in Sub-task A

Abstract

This paper describes the system submitted to Dravidian-Codemix-HASOC2020: Hate Speech and Offensive Content Identification in Dravidian languages (Tamil-English and Malayalam-English). The task aims to identify offensive language in code-mixed dataset of comments/posts in Dravidian languages collected from social media. We participated in both Sub-task A, which aims to identify offensive content in mixed-script (mixture of Native and Roman script) and Sub-task B, which aims to identify offensive content in Roman script, for Dravidian languages. In order to address these tasks, we proposed pre-training ULMFiT on synthetically generated code-mixed data, generated by modelling code-mixed data generation as a Markov process using Markov chains. Our model achieved 0.88 weighted F1-score for code-mixed Tamil-English language in Sub-task B and got 2nd rank on the leader-board. Additionally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection

MethodsTanh Activation · Dropout · Sigmoid Activation · Weight Tying · Embedding Dropout · Long Short-Term Memory · Activation Regularization · Variational Dropout · Discriminative Fine-Tuning · Slanted Triangular Learning Rates