Deep Active Learning with Crowdsourcing Data for Privacy Policy Classification

Wenjun Qiu; David Lie

arXiv:2008.02954·cs.CR·November 4, 2025·1 cites

Deep Active Learning with Crowdsourcing Data for Privacy Policy Classification

Wenjun Qiu, David Lie

PDF

Open Access

TL;DR

This paper introduces Calpric, an active learning and crowdsourcing-based tool that efficiently classifies privacy policies with high accuracy while significantly reducing labeling effort and addressing class imbalance.

Contribution

It presents a novel combination of active learning and crowdsourcing for privacy policy classification, reducing labeling costs and improving class balance.

Findings

01

Achieves same F1 score with only 62% of original labels

02

Effectively addresses class imbalance in privacy policy data

03

Reduces annotation effort comparable to skilled human annotators

Abstract

Privacy policies are statements that notify users of the services' data practices. However, few users are willing to read through policy texts due to the length and complexity. While automated tools based on machine learning exist for privacy policy analysis, to achieve high classification accuracy, classifiers need to be trained on a large labeled dataset. Most existing policy corpora are labeled by skilled human annotators, requiring significant amount of labor hours and effort. In this paper, we leverage active learning and crowdsourcing techniques to develop an automated classification tool named Calpric (Crowdsourcing Active Learning PRIvacy Policy Classifier), which is able to perform annotation equivalent to those done by skilled human annotators with high accuracy while minimizing the labeling cost. Specifically, active learning allows classifiers to proactively select the most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Privacy-Preserving Technologies in Data · Machine Learning and Algorithms