Extractive text summarisation of Privacy Policy documents using machine learning approaches
Chanwoo Choi

TL;DR
This paper presents two clustering-based models for summarizing Privacy Policy documents, with the PDC approach outperforming K-means in extracting GDPR-relevant sentences.
Contribution
Introduces a novel PDC clustering-based summarization method tailored for GDPR compliance in privacy policies, outperforming traditional K-means clustering.
Findings
PDC model outperforms K-means in ROUGE and SSD metrics
PDC effectively segregates sentences based on GDPR topics
Task-specific fine-tuning improves unsupervised summarization results
Abstract
This work demonstrates two Privacy Policy (PP) summarisation models based on two different clustering algorithms: K-means clustering and Pre-determined Centroid (PDC) clustering. K-means is decided to be used for the first model after an extensive evaluation of ten commonly used clustering algorithms. The summariser model based on the PDC-clustering algorithm summarises PP documents by segregating individual sentences by Euclidean distance from each sentence to the pre-defined cluster centres. The cluster centres are defined according to General Data Protection Regulation (GDPR)'s 14 essential topics that must be included in any privacy notices. The PDC model outperformed the K-means model for two evaluation methods, Sum of Squared Distance (SSD) and ROUGE by some margin (27% and 24% respectively). This result contrasts the K-means model's better performance in the general clustering of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Privacy, Security, and Data Protection · Access Control and Trust
Methodsk-Means Clustering · Prime Dilated Convolution
