Initial Study into Application of Feature Density and Linguistically-backed Embedding to Improve Machine Learning-based Cyberbullying Detection
Juuso Eronen, Michal Ptaszynski, Fumito Masui, Gniewosz Leliwa, Michal, Wroczynski, Mateusz Piech, Aleksander Smywinski-Pohl

TL;DR
This study explores how linguistic preprocessing and feature density influence machine learning performance in cyberbullying detection, introducing linguistically-backed embeddings for CNNs and confirming the predictive value of feature density.
Contribution
It introduces a new approach of training linguistically-backed embeddings for CNNs and demonstrates the correlation between feature density and classifier performance.
Findings
Neural networks effectively detect cyberbullying.
Feature density correlates with classifier performance.
Linguistically-backed embeddings improve CNN accuracy.
Abstract
In this research, we study the change in the performance of machine learning (ML) classifiers when various linguistic preprocessing methods of a dataset were used, with the specific focus on linguistically-backed embeddings in Convolutional Neural Networks (CNN). Moreover, we study the concept of Feature Density and confirm its potential to comparatively predict the performance of ML classifiers, including CNN. The research was conducted on a Formspring dataset provided in a Kaggle competition on automatic cyberbullying detection. The dataset was re-annotated by objective experts (psychologists), as the importance of professional annotation in cyberbullying research has been indicated multiple times. The study confirmed the effectiveness of Neural Networks in cyberbullying detection and the correlation between classifier performance and Feature Density while also proposing a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
