Fair multilingual vandalism detection system for Wikipedia
Mykola Trokhymovych, Muniza Aslam, Ai-Jou Chou, Ricardo Baeza-Yates,, and Diego Saez-Trumper

TL;DR
This paper introduces a fair, multilingual vandalism detection system for Wikipedia that improves coverage, accuracy, and reduces bias across 47 languages, enhancing community moderation efforts.
Contribution
The paper presents a novel multilingual vandalism detection system using advanced filtering and masked language modeling, significantly expanding language coverage and outperforming existing tools.
Findings
Increased language coverage to 47 languages.
Outperforms existing Wikipedia vandalism detection system ORES.
Reduces bias against contributor groups.
Abstract
This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system was evaluated through comparison with the one used in production in Wikipedia, known as ORES. Our research results in a significant increase in the number of languages covered, making Wikipedia patrolling more efficient to a wider range of communities. Furthermore, our model outperforms ORES, ensuring that the results provided are not only more accurate but also less biased against certain groups of contributors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWikis in Education and Collaboration · Cancer-related gene regulation · Protein Degradation and Inhibitors
