Detection of Personal Data in Structured Datasets Using a Large Language Model
Albert Agisha Ntwali, Luca R\"uck, Martin Heckmann

TL;DR
This paper introduces a new method using GPT-4o to detect personal data in structured datasets by incorporating contextual information, showing improved performance especially on real-world datasets compared to existing tools.
Contribution
The paper presents a novel approach leveraging GPT-4o with contextual information for personal data detection, outperforming existing methods on real-world datasets.
Findings
GPT-4o-based approach outperforms others on MIMIC-Demo-Ext
Contextual information improves detection in Kaggle and OpenML datasets
Performance varies significantly across datasets
Abstract
We propose a novel approach for detecting personal data in structured datasets, leveraging GPT-4o, a state-of-the-art Large Language Model. A key innovation of our method is the incorporation of contextual information: in addition to a feature's name and values, we utilize information from other feature names within the dataset as well as the dataset description. We compare our approach to alternative methods, including Microsoft Presidio and CASSED, evaluating them on multiple datasets: DeSSI, a large synthetic dataset, datasets we collected from Kaggle and OpenML as well as MIMIC-Demo-Ext, a real-world dataset containing patient information from critical care units. Our findings reveal that detection performance varies significantly depending on the dataset used for evaluation. CASSED excels on DeSSI, the dataset on which it was trained. Performance on the medical dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Data Quality and Management
