KoMultiText: Large-Scale Korean Text Dataset for Classifying Biased   Speech in Real-World Online Services

Dasol Choi; Jooyoung Song; Eunsun Lee; Jinwoo Seo; Heejune Park,; Dongbin Na

arXiv:2310.04313·cs.CL·November 14, 2023·1 cites

KoMultiText: Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services

Dasol Choi, Jooyoung Song, Eunsun Lee, Jinwoo Seo, Heejune Park,, Dongbin Na

PDF

Open Access 2 Repos 3 Datasets

TL;DR

KoMultiText is a large-scale Korean dataset with multi-label annotations for bias, profanity, and preferences, enabling advanced multi-task classification to detect harmful online speech and improve community health.

Contribution

The paper introduces KoMultiText, a comprehensive Korean dataset with multi-task annotations, and demonstrates state-of-the-art BERT-based models surpassing human accuracy in bias detection.

Findings

01

BERT-based models outperform human-level accuracy in classification tasks.

02

KoMultiText enables multi-task learning for bias, profanity, and preference detection.

03

The dataset is publicly available for research and practical applications.

Abstract

With the growth of online services, the need for advanced text classification algorithms, such as sentiment analysis and biased text detection, has become increasingly evident. The anonymous nature of online services often leads to the presence of biased and harmful language, posing challenges to maintaining the health of online communities. This phenomenon is especially relevant in South Korea, where large-scale hate speech detection algorithms have not yet been broadly explored. In this paper, we introduce "KoMultiText", a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform. Our proposed dataset provides annotations including (1) Preferences, (2) Profanities, and (3) Nine types of Bias for the text samples, enabling multi-task learning for simultaneous classification of user-generated texts. Leveraging state-of-the-art BERT-based language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Social Media and Politics