KoMultiText: Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services
Dasol Choi, Jooyoung Song, Eunsun Lee, Jinwoo Seo, Heejune Park,, Dongbin Na

TL;DR
KoMultiText is a large-scale Korean dataset with multi-label annotations for bias, profanity, and preferences, enabling advanced multi-task classification to detect harmful online speech and improve community health.
Contribution
The paper introduces KoMultiText, a comprehensive Korean dataset with multi-task annotations, and demonstrates state-of-the-art BERT-based models surpassing human accuracy in bias detection.
Findings
BERT-based models outperform human-level accuracy in classification tasks.
KoMultiText enables multi-task learning for bias, profanity, and preference detection.
The dataset is publicly available for research and practical applications.
Abstract
With the growth of online services, the need for advanced text classification algorithms, such as sentiment analysis and biased text detection, has become increasingly evident. The anonymous nature of online services often leads to the presence of biased and harmful language, posing challenges to maintaining the health of online communities. This phenomenon is especially relevant in South Korea, where large-scale hate speech detection algorithms have not yet been broadly explored. In this paper, we introduce "KoMultiText", a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform. Our proposed dataset provides annotations including (1) Preferences, (2) Profanities, and (3) Nine types of Bias for the text samples, enabling multi-task learning for simultaneous classification of user-generated texts. Leveraging state-of-the-art BERT-based language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Social Media and Politics
