TL;DR
This paper presents a large-scale, longitudinal dataset of over a million privacy policies from the internet archive, revealing trends in transparency, readability, and regulation over two decades.
Contribution
The authors created and validated a comprehensive dataset of privacy policies spanning 20 years, enabling new longitudinal analyses of privacy practices and regulation impacts.
Findings
Privacy policies have become less transparent regarding tracking technologies.
Readability of privacy policies has doubled in length and increased in reading difficulty.
The GDPR has significantly influenced privacy policy content and structure.
Abstract
Automated analysis of privacy policies has proved a fruitful research direction, with developments such as automated policy summarization, question answering systems, and compliance detection. Prior research has been limited to analysis of privacy policies from a single point in time or from short spans of time, as researchers did not have access to a large-scale, longitudinal, curated dataset. To address this gap, we developed a crawler that discovers, downloads, and extracts archived privacy policies from the Internet Archive's Wayback Machine. Using the crawler and following a series of validation and quality control steps, we curated a dataset of 1,071,488 English language privacy policies, spanning over two decades and over 130,000 distinct websites. Our analyses of the data paint a troubling picture of the transparency and accessibility of privacy policies. By comparing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
