UPB at SemEval-2020 Task 12: Multilingual Offensive Language Detection   on Social Media by Fine-tuning a Variety of BERT-based Models

Mircea-Adrian Tanase; Dumitru-Clementin Cercel; Costin-Gabriel; Chiru

arXiv:2010.13609·cs.CL·October 28, 2020

UPB at SemEval-2020 Task 12: Multilingual Offensive Language Detection on Social Media by Fine-tuning a Variety of BERT-based Models

Mircea-Adrian Tanase, Dumitru-Clementin Cercel, Costin-Gabriel, Chiru

PDF

Open Access

TL;DR

This paper explores multilingual offensive language detection on social media using fine-tuned BERT-based models across five languages, achieving competitive results in the SemEval-2020 shared task.

Contribution

It evaluates multiple transformer architectures and training strategies for multilingual offensive language detection, providing insights into their comparative effectiveness.

Findings

01

Best models achieved top 10 rankings in several languages.

02

Multilingual models performed competitively with single-language models.

03

Fine-tuning on combined datasets improved detection accuracy.

Abstract

Offensive language detection is one of the most challenging problem in the natural language processing field, being imposed by the rising presence of this phenomenon in online social media. This paper describes our Transformer-based solutions for identifying offensive language on Twitter in five languages (i.e., English, Arabic, Danish, Greek, and Turkish), which was employed in Subtask A of the Offenseval 2020 shared task. Several neural architectures (i.e., BERT, mBERT, Roberta, XLM-Roberta, and ALBERT), pre-trained using both single-language and multilingual corpora, were fine-tuned and compared using multiple combinations of datasets. Finally, the highest-scoring models were used for our submissions in the competition, which ranked our team 21st of 85, 28th of 53, 19th of 39, 16th of 37, and 10th of 46 for English, Arabic, Danish, Greek, and Turkish, respectively.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection

MethodsLinear Layer · mBERT · Layer Normalization · Adam · Dense Connections · Dropout · Linear Warmup With Linear Decay · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay