MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers
Ibrahim Baroud, Christoph Otto, Vera Czehmann, Christine Hovhannisyan, Lisa Raithel, Sebastian M\"oller, Roland Roller

TL;DR
This paper introduces MultiGraSCCo, a multilingual anonymization benchmark with annotated personal identifiers across ten languages, created using machine translation to facilitate privacy-preserving medical data sharing and model training.
Contribution
It presents a novel multilingual anonymization benchmark with preserved annotations, generated via machine translation, to support privacy-compliant medical data analysis and model development.
Findings
High-quality translations confirmed by medical professionals
Over 2,500 annotated personal information instances included
Benchmark supports training, validation, and automatic detection improvements
Abstract
Accessing sensitive patient data for machine learning is challenging due to privacy concerns. Datasets with annotations of personally identifiable information are crucial for developing and testing anonymization systems to enable safe data sharing that complies with privacy regulations. Since accessing real patient data is a bottleneck, synthetic data offers an efficient solution for data scarcity, bypassing privacy regulations that apply to real data. Moreover, neural machine translation can help to create high-quality data for low-resource languages by translating validated real or synthetic data from a high-resource language. In this work, we create a multilingual anonymization benchmark in ten languages, using a machine translation methodology that preserves the original annotations and renders names of cities and people in a culturally and contextually appropriate form in each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education
