Calpric: Inclusive and Fine-grain Labeling of Privacy Policies with   Crowdsourcing and Active Learning

Wenjun Qiu; David Lie; and Lisa Austin

arXiv:2401.08038·cs.CL·January 17, 2024·1 cites

Calpric: Inclusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning

Wenjun Qiu, David Lie, and Lisa Austin

PDF

Open Access 1 Repo

TL;DR

Calpric introduces a cost-effective method combining automated segmentation, crowdsourcing, and active learning to generate detailed, balanced privacy policy datasets, enabling more accurate deep learning models.

Contribution

It presents a novel integrated approach that reduces labeling costs and improves data quality for privacy policy analysis using crowdsourcing and active learning.

Findings

01

Generated 16K labeled privacy policy segments across 9 categories

02

Achieved labeling costs of roughly $0.92-$1.71 per segment

03

Produced models with improved accuracy and fine-grain labeling

Abstract

A significant challenge to training accurate deep learning models on privacy policies is the cost and difficulty of obtaining a large and comprehensive set of training data. To address these challenges, we present Calpric , which combines automatic text selection and segmentation, active learning and the use of crowdsourced annotators to generate a large, balanced training set for privacy policies at low cost. Automated text selection and segmentation simplifies the labeling task, enabling untrained annotators from crowdsourcing platforms, like Amazon's Mechanical Turk, to be competitive with trained annotators, such as law students, and also reduces inter-annotator agreement, which decreases labeling cost. Having reliable labels for training enables the use of active learning, which uses fewer training samples to efficiently cover the input space, further reducing cost and improving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dlgroupuoft/calpric
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Privacy-Preserving Technologies in Data · Hate Speech and Cyberbullying Detection

MethodsSparse Evolutionary Training