Measuring the Prevalence of Policy Violating Content with ML Assisted Sampling and LLM Labeling

Attila Dobi; Aravindh Manickavasagam; Benjamin Thompson; Xiaohan Yang; Faisal Farooq

arXiv:2602.18518·cs.LG·February 24, 2026

Measuring the Prevalence of Policy Violating Content with ML Assisted Sampling and LLM Labeling

Attila Dobi, Aravindh Manickavasagam, Benjamin Thompson, Xiaohan Yang, Faisal Farooq

PDF

Open Access

TL;DR

This paper introduces a scalable, ML-assisted sampling and LLM-based labeling system for accurately measuring the prevalence of policy-violating content in user impressions, enabling detailed, real-time content safety insights.

Contribution

It presents a novel design-based measurement system combining ML sampling, LLM labeling, and statistical estimation for efficient, unbiased prevalence measurement across multiple content segments.

Findings

01

Achieves unbiased prevalence estimates with confidence intervals.

02

Supports multi-dimensional analysis from a single global sample.

03

Improves efficiency in detecting rare policy violations.

Abstract

Content safety teams need metrics that reflect what users actually experience, not only what is reported. We study prevalence: the fraction of user views (impressions) that went to content violating a given policy on a given day. Accurate prevalence measurement is challenging because violations are often rare and human labeling is costly, making frequent, platform-representative studies slow. We present a design-based measurement system that (i) draws daily probability samples from the impression stream using ML-assisted weights to concentrate label budget on high-exposure and high-risk content while preserving unbiasedness, (ii) labels sampled items with a multimodal LLM governed by policy prompts and gold-set validation, and (iii) produces design-consistent prevalence estimates with confidence intervals and dashboard drilldowns. A key design goal is one global sample with many pivots:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics · Safety Warnings and Signage · Hate Speech and Cyberbullying Detection