The Rise of AI-Generated Content in Wikipedia
Creston Brooks, Samuel Eggert, Denis Peskoff

TL;DR
This paper investigates the increasing presence of AI-generated content in Wikipedia, revealing that over 5% of recent articles are AI-produced, which tend to be lower quality and biased, raising concerns about information integrity.
Contribution
It introduces a methodology using AI detectors to quantify AI-generated content in Wikipedia and demonstrates a significant rise in such content after GPT-3.5's release.
Findings
Over 5% of recent Wikipedia articles are AI-generated.
AI-generated articles are generally of lower quality.
Bias and self-promotion are common in flagged articles.
Abstract
The rise of AI-generated content in popular information sources raises significant concerns about accountability, accuracy, and bias amplification. Beyond directly impacting consumers, the widespread presence of this content poses questions for the long-term viability of training language models on vast internet sweeps. We use GPTZero, a proprietary AI detector, and Binoculars, an open-source alternative, to establish lower bounds on the presence of AI-generated content in recently created Wikipedia pages. Both detectors reveal a marked increase in AI-generated content in recent pages compared to those from before the release of GPT-3.5. With thresholds calibrated to achieve a 1% false positive rate on pre-GPT-3.5 articles, detectors flag over 5% of newly created English Wikipedia articles as AI-generated, with lower percentages for German, French, and Italian articles. Flagged…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsWikis in Education and Collaboration · Natural Language Processing Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Residual Connection · Attention Is All You Need · Linear Layer · Weight Decay · Cosine Annealing · Dropout · Byte Pair Encoding · Softmax
