Detecting Toxicity in News Articles: Application to Bulgarian

Yoan Dinkov; Ivan Koychev; Preslav Nakov

arXiv:1908.09785·cs.CL·August 27, 2019

Detecting Toxicity in News Articles: Application to Bulgarian

Yoan Dinkov, Ivan Koychev, Preslav Nakov

PDF

Open Access 1 Repo

TL;DR

This paper presents a new Bulgarian news toxicity detection system that leverages multiple language models and features, achieving notable improvements over baseline accuracy despite limited dataset size.

Contribution

It introduces a novel Bulgarian news toxicity dataset and develops a multi-model ensemble classifier tailored for Bulgarian language toxicity detection.

Findings

01

Achieved 59.0% accuracy and 39.7% macro-F1 score.

02

Created a new dataset with 8 toxicity categories.

03

Demonstrated the effectiveness of combining multiple feature-based models.

Abstract

Online media aim for reaching ever bigger audience and for attracting ever longer attention span. This competition creates an environment that rewards sensational, fake, and toxic news. To help limit their spread and impact, we propose and develop a news toxicity detector that can recognize various types of toxic content. While previous research primarily focused on English, here we target Bulgarian. We created a new dataset by crawling a website that for five years has been collecting Bulgarian news articles that were manually categorized into eight toxicity groups. Then we trained a multi-class classifier with nine categories: eight toxic and one non-toxic. We experimented with different representations based on ElMo, BERT, and XLM, as well as with a variety of domain-specific features. Due to the small size of our dataset, we created a separate model for each feature type, and we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yoandinkov/ranlp-2019
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Advanced Malware Detection Techniques · Software Engineering Research

MethodsLinear Layer · Tanh Activation · Sigmoid Activation · Long Short-Term Memory · Bidirectional LSTM · ELMo · Weight Decay · Residual Connection · Adam · Byte Pair Encoding