Calpric: Inclusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning
Wenjun Qiu, David Lie, and Lisa Austin

TL;DR
Calpric introduces a cost-effective method combining automated segmentation, crowdsourcing, and active learning to generate detailed, balanced privacy policy datasets, enabling more accurate deep learning models.
Contribution
It presents a novel integrated approach that reduces labeling costs and improves data quality for privacy policy analysis using crowdsourcing and active learning.
Findings
Generated 16K labeled privacy policy segments across 9 categories
Achieved labeling costs of roughly $0.92-$1.71 per segment
Produced models with improved accuracy and fine-grain labeling
Abstract
A significant challenge to training accurate deep learning models on privacy policies is the cost and difficulty of obtaining a large and comprehensive set of training data. To address these challenges, we present Calpric , which combines automatic text selection and segmentation, active learning and the use of crowdsourced annotators to generate a large, balanced training set for privacy policies at low cost. Automated text selection and segmentation simplifies the labeling task, enabling untrained annotators from crowdsourcing platforms, like Amazon's Mechanical Turk, to be competitive with trained annotators, such as law students, and also reduces inter-annotator agreement, which decreases labeling cost. Having reliable labels for training enables the use of active learning, which uses fewer training samples to efficiently cover the input space, further reducing cost and improving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Privacy-Preserving Technologies in Data · Hate Speech and Cyberbullying Detection
MethodsSparse Evolutionary Training
