SKALD: Scalable K-Anonymisation for Large Datasets

Kailash Reddy; Novoneel Chakraborty; Amogh Dharmavaram; Anshoo Tandon

arXiv:2505.03529·cs.IT·July 2, 2025

SKALD: Scalable K-Anonymisation for Large Datasets

Kailash Reddy, Novoneel Chakraborty, Amogh Dharmavaram, Anshoo Tandon

PDF

Open Access

TL;DR

This paper introduces SKALD, a scalable algorithm for k-anonymisation that efficiently processes large datasets exceeding available RAM by leveraging chunk-based processing and sufficient statistics, improving performance and utility.

Contribution

SKALD is a novel algorithm that enables scalable k-anonymisation on large datasets by processing chunks with limited memory, outperforming standard methods.

Findings

01

Multi-fold performance improvement over existing methods

02

Effective processing of datasets exceeding RAM limits

03

Enhanced data utility post-anonymisation

Abstract

Data privacy and anonymisation are critical concerns in today's data-driven society, particularly when handling personal and sensitive user data. Regulatory frameworks worldwide recommend privacy-preserving protocols such as k-anonymisation to de-identify releases of tabular data. Available hardware resources provide an upper bound on the maximum size of dataset that can be processed at a time. Large datasets with sizes exceeding this upper bound must be broken up into smaller data chunks for processing. In these cases, standard k-anonymisation tools such as ARX can only operate on a per-chunk basis. This paper proposes SKALD, a novel algorithm for performing k-anonymisation on large datasets with limited RAM. Our SKALD algorithm offers multi-fold performance improvement over standard k-anonymisation methods by extracting and combining sufficient statistics from each chunk during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Cryptography and Data Security · Data Quality and Management