Towards Contextual Sensitive Data Detection
Liang Telkamp, Madelon Hulsebos

TL;DR
This paper presents a contextual framework for sensitive data detection that considers data type and domain context, significantly improving detection accuracy and reducing false positives, with practical applications demonstrated through experiments and case studies.
Contribution
It introduces a novel contextual sensitivity framework for data detection, combining type and domain contextualization, and validates its effectiveness with language models and real-world case studies.
Findings
Type-contextualization achieves 94% recall, outperforming commercial tools at 63%.
Domain-contextualization enhances detection in non-standard data domains.
Context-grounded explanations aid manual data auditing.
Abstract
The emergence of open data portals necessitates more attention to protecting sensitive data before datasets get published and exchanged. To do so effectively, we observe the need to refine and broaden our definitions of sensitive data, and argue that the sensitivity of data depends on its context. Following this definition, we introduce a contextual data sensitivity framework building on two core concepts: 1) type contextualization, which considers the type of the data values at hand within the overall context of the dataset or document to assess their true sensitivity, and 2) domain contextualization, which assesses the sensitivity of data values informed by domain-specific information external to the dataset, such as geographic origin of a dataset. Experiments instrumented with language models confirm that: 1) type-contextualization significantly reduces the number of false positives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Privacy, Security, and Data Protection
