TL;DR
This paper demonstrates that cross-lingual contextual embeddings combined with transfer learning effectively identify offensive language across multiple languages with limited resources, outperforming existing systems.
Contribution
It introduces a transfer learning approach using cross-lingual embeddings for offensive language detection in low-resource languages, showing strong results across Bengali, Hindi, and Spanish.
Findings
Achieved high F1 macro scores: 0.8415 for Bengali, 0.8568 for Hindi, 0.7513 for Spanish.
Outperformed recent shared task systems in offensive language detection.
Confirmed robustness of cross-lingual embeddings for multilingual offensive content identification.
Abstract
Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbulling, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this paper, we take advantage of English data available by applying cross-lingual contextual word embeddings and transfer learning to make predictions in languages with less resources. We project predictions on comparable data in Bengali, Hindi, and Spanish and we report results of 0.8415 F1 macro for Bengali, 0.8568 F1 macro for Hindi, and 0.7513 F1 macro for Spanish. Finally, we show that our approach compares favorably to the best systems submitted to recent shared tasks on these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
