Multilingual Offensive Language Identification for Low-resource Languages
Tharindu Ranasinghe, Marcos Zampieri

TL;DR
This paper presents a transfer learning approach using cross-lingual embeddings to detect offensive language in low-resource languages, achieving competitive results across multiple languages.
Contribution
It introduces a novel application of cross-lingual contextual embeddings and transfer learning for offensive language detection in low-resource languages, leveraging English datasets.
Findings
High F1 scores achieved in multiple languages, e.g., Bengali 0.8415, Greek 0.8701.
Approach outperforms or matches state-of-the-art systems in shared tasks.
Demonstrates robustness of cross-lingual embeddings for offensive language detection.
Abstract
Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this paper, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task, 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020, 0.8568 F1 macro for Hindi in HASOC 2019 shared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
