Discovering and Categorising Language Biases in Reddit
Xavier Ferrer, Tom van Nuenen, Jose M. Such, Natalia Criado

TL;DR
This paper introduces a data-driven method using word embeddings to automatically discover and categorize language biases in Reddit communities, addressing limitations of previous approaches that rely on predefined bias sets.
Contribution
The study presents a novel approach that detects community-specific biases in Reddit data without predefined bias categories, suitable for smaller and slang-rich datasets.
Findings
Successfully identified gender, religion, and ethnic biases in Reddit communities.
Validated the method by comparing biases with those in Google News dataset.
Demonstrated the approach's effectiveness in community-centric online discourse.
Abstract
We present a data-driven approach using word embeddings to discover and categorise language biases on the discussion platform Reddit. As spaces for isolated user communities, platforms such as Reddit are increasingly connected to issues of racism, sexism and other forms of discrimination. Hence, there is a need to monitor the language of these groups. One of the most promising AI approaches to trace linguistic biases in large textual datasets involves word embeddings, which transform text into high-dimensional dense vectors and capture semantic relations between words. Yet, previous studies require predefined sets of potential biases to study, e.g., whether gender is more or less associated with particular types of jobs. This makes these approaches unfit to deal with smaller and community-centric datasets such as those on Reddit, which contain smaller vocabularies and slang, as well as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Authorship Attribution and Profiling · Media Influence and Politics
