Investigating the Impact of Semi-Supervised Methods with Data   Augmentation on Offensive Language Detection in Romanian Language

Elena-Beatrice Nicola; Dumitru-Clementin Cercel; Florin Pop

arXiv:2407.20076·cs.CL·July 31, 2024

Investigating the Impact of Semi-Supervised Methods with Data Augmentation on Offensive Language Detection in Romanian Language

Elena-Beatrice Nicola, Dumitru-Clementin Cercel, Florin Pop

PDF

Open Access

TL;DR

This paper evaluates semi-supervised learning combined with data augmentation techniques to improve offensive language detection in Romanian, demonstrating that certain methods significantly benefit from augmentation.

Contribution

It introduces and compares eight semi-supervised methods with data augmentation for Romanian offensive language detection, highlighting their effectiveness.

Findings

01

Some semi-supervised methods benefit more from data augmentation.

02

Augmentation techniques improve model robustness.

03

Certain methods outperform others with augmentation.

Abstract

Offensive language detection is a crucial task in today's digital landscape, where online platforms grapple with maintaining a respectful and inclusive environment. However, building robust offensive language detection models requires large amounts of labeled data, which can be expensive and time-consuming to obtain. Semi-supervised learning offers a feasible solution by utilizing labeled and unlabeled data to create more accurate and robust models. In this paper, we explore a few different semi-supervised methods, as well as data augmentation techniques. Concretely, we implemented eight semi-supervised methods and ran experiments for them using only the available data in the RO-Offense dataset and applying five augmentation techniques before feeding the data to the models. Experimental results demonstrate that some of them benefit more from augmentations than others.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Interpreting and Communication in Healthcare · Text Readability and Simplification