AspirinSum: an Aspect-based utility-preserved de-identification Summarization framework
Ya-Lun Li

TL;DR
AspirinSum is a domain-adaptable, aspect-based summarization framework that de-identifies sensitive text data by extracting and replacing personal aspects, facilitating privacy-compliant data sharing for training large language models.
Contribution
It introduces a novel, domain-adaptable de-identification method that preserves utility by focusing on personal aspects, reducing reliance on human annotation.
Findings
Efficiently summarizes and de-identifies sensitive text data.
Leverages expert knowledge without additional human annotation.
Enables privacy-compliant data sharing for domain-specific LLM training.
Abstract
Due to the rapid advancement of Large Language Model (LLM), the whole community eagerly consumes any available text data in order to train the LLM. Currently, large portion of the available text data are collected from internet, which has been thought as a cheap source of the training data. However, when people try to extend the LLM's capability to the personal related domain, such as healthcare or education, the lack of public dataset in these domains make the adaption of the LLM in such domains much slower. The reason of lacking public available dataset in such domains is because they usually contain personal sensitive information. In order to comply with privacy law, the data in such domains need to be de-identified before any kind of dissemination. It had been much research tried to address this problem for the image or tabular data. However, there was limited research on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Metabolomics and Mass Spectrometry Studies · Topic Modeling
MethodsALIGN
