4chan & 8chan embeddings

Pierre Vou\'e; Tom De Smedt; Guy De Pauw

arXiv:2005.06946·cs.CL·May 15, 2020·6 cites

4chan & 8chan embeddings

Pierre Vou\'e, Tom De Smedt, Guy De Pauw

PDF

Open Access

TL;DR

This paper presents a large dataset of message embeddings from 4chan and 8chan's /pol/ boards, aimed at studying toxic language and improving hate speech detection.

Contribution

It introduces a comprehensive collection of 30 million messages and trained word embeddings, publicly released for research on toxic discourse.

Findings

01

Embeddings capture toxic language patterns

02

Dataset enables improved hate speech detection

03

Resource available for further research

Abstract

We have collected over 30M messages from the publicly available /pol/ message boards on 4chan and 8chan, and compiled them into a model of toxic language use. The trained word embeddings (0.4GB) are released for free and may be useful for further study on toxic discourse or to boost hate speech detection systems: https://textgain.com/8chan.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMolecular Junctions and Nanostructures · Organic Electronics and Photovoltaics · Supramolecular Self-Assembly in Materials