Deep Active Learning with Crowdsourcing Data for Privacy Policy Classification
Wenjun Qiu, David Lie

TL;DR
This paper introduces Calpric, an active learning and crowdsourcing-based tool that efficiently classifies privacy policies with high accuracy while significantly reducing labeling effort and addressing class imbalance.
Contribution
It presents a novel combination of active learning and crowdsourcing for privacy policy classification, reducing labeling costs and improving class balance.
Findings
Achieves same F1 score with only 62% of original labels
Effectively addresses class imbalance in privacy policy data
Reduces annotation effort comparable to skilled human annotators
Abstract
Privacy policies are statements that notify users of the services' data practices. However, few users are willing to read through policy texts due to the length and complexity. While automated tools based on machine learning exist for privacy policy analysis, to achieve high classification accuracy, classifiers need to be trained on a large labeled dataset. Most existing policy corpora are labeled by skilled human annotators, requiring significant amount of labor hours and effort. In this paper, we leverage active learning and crowdsourcing techniques to develop an automated classification tool named Calpric (Crowdsourcing Active Learning PRIvacy Policy Classifier), which is able to perform annotation equivalent to those done by skilled human annotators with high accuracy while minimizing the labeling cost. Specifically, active learning allows classifiers to proactively select the most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Privacy-Preserving Technologies in Data · Machine Learning and Algorithms
