Wikipedia Vandalism Detection Through Machine Learning: Feature Review   and New Proposals: Lab Report for PAN at CLEF 2010

Santiago M. Mola-Velasco

arXiv:1210.5560·cs.IR·October 23, 2012·24 cites

Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals: Lab Report for PAN at CLEF 2010

Santiago M. Mola-Velasco

PDF

Open Access

TL;DR

This paper reviews features for detecting Wikipedia vandalism using machine learning, extends previous frameworks, and reports that a Random Forest classifier achieved top performance in the PAN 2010 vandalism detection challenge.

Contribution

It extends prior vandalism detection frameworks by proposing new features and demonstrates that Random Forest classifiers outperform others in this task.

Findings

01

Random Forest achieved an AUC of 0.92236

02

The approach ranked first in the PAN 2010 vandalism detection task

03

Supervised learning effectively detects vandalism in Wikipedia edits

Abstract

Wikipedia is an online encyclopedia that anyone can edit. In this open model, some people edits with the intent of harming the integrity of Wikipedia. This is known as vandalism. We extend the framework presented in (Potthast, Stein, and Gerling, 2008) for Wikipedia vandalism detection. In this approach, several vandalism indicating features are extracted from edits in a vandalism corpus and are fed to a supervised learning algorithm. The best performing classifiers were LogitBoost and Random Forest. Our classifier, a Random Forest, obtained an AUC of 0.92236, ranking in the first place of the PAN'10 Wikipedia vandalism detection task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWikis in Education and Collaboration · Natural Language Processing Techniques · Software Engineering Research