Measuring the Prevalence of Policy Violating Content with ML Assisted Sampling and LLM Labeling
Attila Dobi, Aravindh Manickavasagam, Benjamin Thompson, Xiaohan Yang, Faisal Farooq

TL;DR
This paper introduces a scalable, ML-assisted sampling and LLM-based labeling system for accurately measuring the prevalence of policy-violating content in user impressions, enabling detailed, real-time content safety insights.
Contribution
It presents a novel design-based measurement system combining ML sampling, LLM labeling, and statistical estimation for efficient, unbiased prevalence measurement across multiple content segments.
Findings
Achieves unbiased prevalence estimates with confidence intervals.
Supports multi-dimensional analysis from a single global sample.
Improves efficiency in detecting rare policy violations.
Abstract
Content safety teams need metrics that reflect what users actually experience, not only what is reported. We study prevalence: the fraction of user views (impressions) that went to content violating a given policy on a given day. Accurate prevalence measurement is challenging because violations are often rare and human labeling is costly, making frequent, platform-representative studies slow. We present a design-based measurement system that (i) draws daily probability samples from the impression stream using ML-assisted weights to concentrate label budget on high-exposure and high-risk content while preserving unbiasedness, (ii) labels sampled items with a multimodal LLM governed by policy prompts and gold-set validation, and (iii) produces design-consistent prevalence estimates with confidence intervals and dashboard drilldowns. A key design goal is one global sample with many pivots:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Safety Warnings and Signage · Hate Speech and Cyberbullying Detection
