Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study
Dara Bahri, Yi Tay, Che Zheng, Donald Metzler, Cliff Brunk, Andrew, Tomkins

TL;DR
This study shows that classifiers trained to distinguish human from machine-generated text can serve as unsupervised indicators of webpage quality, enabling scalable quality assessment without labeled data.
Contribution
It introduces a novel use of generative models as unsupervised predictors of page quality and provides the largest-scale analysis of low-quality web content to date.
Findings
Classifiers can predict page quality without supervision.
Large-scale analysis reveals prevalence of low-quality pages.
Generative models assist in scalable content quality assessment.
Abstract
Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of "page quality", able to detect low quality content without any training. This enables fast bootstrapping of quality indicators in a low-resource setting. Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Cosine Annealing · Dense Connections · Linear Warmup With Cosine Annealing · Layer Normalization · Attention Dropout · Attention Is All You Need · Byte Pair Encoding · Adam · Weight Decay
