A Large-scale Dataset for Hate Speech Detection on Vietnamese Social   Media Texts

Son T. Luu; Kiet Van Nguyen; Ngan Luu-Thuy Nguyen

arXiv:2103.11528·cs.CL·July 21, 2021

A Large-scale Dataset for Hate Speech Detection on Vietnamese Social Media Texts

Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

PDF

2 Repos 1 Models

TL;DR

This paper introduces ViHSD, a large annotated dataset of Vietnamese social media comments labeled for hate speech, enabling improved automatic detection using deep learning and transformer models.

Contribution

The paper presents a new large-scale, human-annotated Vietnamese hate speech dataset and details its creation and evaluation process.

Findings

01

Deep learning models achieved high accuracy on the dataset

02

Transformer models outperformed traditional methods

03

The dataset facilitates future hate speech detection research in Vietnamese

Abstract

In recent years, Vietnam witnesses the mass development of social network users on different social platforms such as Facebook, Youtube, Instagram, and Tiktok. On social medias, hate speech has become a critical problem for social network users. To solve this problem, we introduce the ViHSD - a human-annotated dataset for automatically detecting hate speech on the social network. This dataset contains over 30,000 comments, each comment in the dataset has one of three labels: CLEAN, OFFENSIVE, or HATE. Besides, we introduce the data creation process for annotating and evaluating the quality of the dataset. Finally, we evaluated the dataset by deep learning models and transformer models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
nd-khoa/vihsd-uit-visobert-v2
model· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.