Transforming Sensitive Documents into Quantitative Data: An AI-Based Preprocessing Toolchain for Structured and Privacy-Conscious Analysis
Anders Ledberg, Anna Thal\'en

TL;DR
This paper introduces a privacy-preserving, modular AI-based preprocessing toolchain that standardizes, anonymizes, and transforms unstructured sensitive text data into structured embeddings, facilitating large-scale analysis in privacy-sensitive fields.
Contribution
The authors develop an open-weight, local hardware-compatible toolchain that standardizes, anonymizes, and converts sensitive unstructured text into embeddings, enabling scalable, privacy-conscious research.
Findings
Effective removal of personally identifiable information
High semantic content retention after anonymization
Successful application to Swedish court decision corpus
Abstract
Unstructured text from legal, medical, and administrative sources offers a rich but underutilized resource for research in public health and the social sciences. However, large-scale analysis is hampered by two key challenges: the presence of sensitive, personally identifiable information, and significant heterogeneity in structure and language. We present a modular toolchain that prepares such text data for embedding-based analysis, relying entirely on open-weight models that run on local hardware, requiring only a workstation-level GPU and supporting privacy-sensitive research. The toolchain employs large language model (LLM) prompting to standardize, summarize, and, when needed, translate texts to English for greater comparability. Anonymization is achieved via LLM-based redaction, supplemented with named entity recognition and rule-based methods to minimize the risk of disclosure.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data
