4chan & 8chan embeddings
Pierre Vou\'e, Tom De Smedt, Guy De Pauw

TL;DR
This paper presents a large dataset of message embeddings from 4chan and 8chan's /pol/ boards, aimed at studying toxic language and improving hate speech detection.
Contribution
It introduces a comprehensive collection of 30 million messages and trained word embeddings, publicly released for research on toxic discourse.
Findings
Embeddings capture toxic language patterns
Dataset enables improved hate speech detection
Resource available for further research
Abstract
We have collected over 30M messages from the publicly available /pol/ message boards on 4chan and 8chan, and compiled them into a model of toxic language use. The trained word embeddings (0.4GB) are released for free and may be useful for further study on toxic discourse or to boost hate speech detection systems: https://textgain.com/8chan.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMolecular Junctions and Nanostructures · Organic Electronics and Photovoltaics · Supramolecular Self-Assembly in Materials
