ClustEm4Ano: Clustering Text Embeddings of Nominal Textual Attributes for Microdata Anonymization
Robert Aufschl\"ager, Sebastian Wilhelm, Michael Heigl, Martin, Schramm

TL;DR
ClustEm4Ano is a novel anonymization pipeline that automatically generates semantic value generalization hierarchies for nominal textual data using embeddings and clustering, improving data privacy and utility.
Contribution
It introduces an automated method for creating value generalization hierarchies via clustering of text embeddings, enhancing anonymization of nominal data.
Findings
VGHs generated outperform manual hierarchies in anonymization tasks.
The approach improves data utility for small k-anonymity levels.
Experimental validation on the Adult dataset confirms effectiveness.
Abstract
This work introduces ClustEm4Ano, an anonymization pipeline that can be used for generalization and suppression-based anonymization of nominal textual tabular data. It automatically generates value generalization hierarchies (VGHs) that, in turn, can be used to generalize attributes in quasi-identifiers. The pipeline leverages embeddings to generate semantically close value generalizations through iterative clustering. We applied KMeans and Hierarchical Agglomerative Clustering on different predefined text embeddings (both open and closed-source (via APIs)). Our approach is experimentally tested on a well-known benchmark dataset for anonymization: The UCI Machine Learning Repository's Adult dataset. ClustEm4Ano supports anonymization procedures by offering more possibilities compared to using arbitrarily chosen VGHs. Experiments demonstrate that these VGHs can outperform manually…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Computational and Text Analysis Methods
