Towards Efficient and Explainable Hate Speech Detection via Model Distillation
Paloma Piot, Javier Parapar

TL;DR
This paper introduces a method to distill large language models into smaller, efficient models that can accurately classify and explain hate speech, making detection more accessible and interpretable.
Contribution
The paper presents a novel distillation approach using Chain-of-Thought to produce smaller models that maintain explanation quality and improve classification performance.
Findings
Distilled models match large models in explanation quality.
Distilled models outperform large models in classification accuracy.
Smaller models are more suitable for operational deployment.
Abstract
Automatic detection of hate and abusive language is essential to combat its online spread. Moreover, recognising and explaining hate speech serves to educate people about its negative effects. However, most current detection models operate as black boxes, lacking interpretability and explainability. In this context, Large Language Models (LLMs) have proven effective for hate speech detection and to promote interpretability. Nevertheless, they are computationally costly to run. In this work, we propose distilling big language models by using Chain-of-Thought to extract explanations that support the hate speech classification task. Having small language models for these tasks will contribute to their use in operational settings. In this paper, we demonstrate that distilled models deliver explanations of the same quality as larger models while surpassing them in classification performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
