Sex, drugs, and violence

Stefania Raimondo; Frank Rudzicz

arXiv:1608.03448·cs.CL·August 12, 2016·1 cites

Sex, drugs, and violence

Stefania Raimondo, Frank Rudzicz

PDF

Open Access

TL;DR

This paper presents an unsupervised NLP approach using topic modeling to detect inappropriate content in online narratives, achieving high recall and low error rates.

Contribution

It introduces a novel application of topic modeling to automatically assess content appropriateness with minimal supervision.

Findings

01

Recall up to 96% in detecting inappropriate content

02

Effective regression of appropriateness ratings using inferred topics

03

Potential for scalable moderation of online user-generated content

Abstract

Automatically detecting inappropriate content can be a difficult NLP task, requiring understanding context and innuendo, not just identifying specific keywords. Due to the large quantity of online user-generated content, automatic detection is becoming increasingly necessary. We take a largely unsupervised approach using a large corpus of narratives from a community-based self-publishing website and a small segment of crowd-sourced annotations. We explore topic modelling using latent Dirichlet allocation (and a variation), and use these to regress appropriateness ratings, effectively automating rating for suitability. The results suggest that certain topics inferred may be useful in detecting latent inappropriateness -- yielding recall up to 96% and low regression errors.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Authorship Attribution and Profiling · Spam and Phishing Detection