Data Science Kitchen at GermEval 2021: A Fine Selection of Hand-Picked Features, Delivered Fresh from the Oven
Niclas Hildebrandt, Benedikt Boenninghoff, Dennis Orth and, Christopher Schymura

TL;DR
This paper describes a feature-engineering approach using semantic and style embeddings combined with numerical features, applied to classify toxic, engaging, and fact-claiming comments in a shared task, achieving competitive F1-scores.
Contribution
It introduces a novel combination of semantic, style, and numerical features with ensemble classifiers for comment classification tasks.
Findings
Achieved macro F1-scores of 66.8%, 69.9%, and 72.5% for the three subtasks.
Demonstrated effectiveness of combining deep neural embeddings with handcrafted features.
Showcased the utility of ensemble voting in multi-label comment classification.
Abstract
This paper presents the contribution of the Data Science Kitchen at GermEval 2021 shared task on the identification of toxic, engaging, and fact-claiming comments. The task aims at extending the identification of offensive language, by including additional subtasks that identify comments which should be prioritized for fact-checking by moderators and community managers. Our contribution focuses on a feature-engineering approach with a conventional classification backend. We combine semantic and writing style embeddings derived from pre-trained deep neural networks with additional numerical features, specifically designed for this task. Classifier ensembles are used to derive predictions for each subtask via a majority voting scheme. Our best submission achieved macro-averaged F1-scores of 66.8\%,\,69.9\% and 72.5\% for the identification of toxic, engaging, and fact-claiming comments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Software Engineering Research
MethodsLogistic Regression
