"hasSignification()": une nouvelle fonction de distance pour soutenir la d\'etection de donn\'ees personnelles
Amine Mrabet, Ali Hassan, Patrice Darmon (Umanis)

TL;DR
This paper introduces a novel distance function called "hasSignification()" to improve automatic detection of personal data attributes by better assessing the meaningfulness of attribute names through enhanced string similarity measures.
Contribution
It proposes a new exponential-based distance function and a double dictionary scan method to better evaluate attribute name significance for data discovery.
Findings
The new distance function outperforms traditional methods like N-Gram, Jaro-Winkler, and Levenshtein.
The exponential scoring improves threshold setting for attribute validation.
Double dictionary scan effectively handles compound attribute names.
Abstract
Today with Big Data and data lakes, we are faced of a mass of data that is very difficult to manage it manually. The protection of personal data in this context requires an automatic analysis for data discovery. Storing the names of attributes already analyzed in a knowledge base could optimize this automatic discovery. To have a better knowledge base, we should not store any attributes whose name does not make sense. In this article, to check if the name of an attribute has a meaning, we propose a solution that calculate the distances between this name and the words in a dictionary. Our studies on the distance functions like N-Gram, Jaro-Winkler and Levenshtein show limits to set an acceptance threshold for an attribute in the knowledge base. In order to overcome these limitations, our solution aims to strengthen the score calculation by using an exponential function based on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsBalanced Selection
