PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation
Xiang Yue, Shuang Zhou

TL;DR
This paper introduces PHICON, a data augmentation technique that improves the generalization of clinical text de-identification models by enhancing training data with entity and context variations, leading to better cross-dataset performance.
Contribution
PHICON is a novel data augmentation method combining PHI and context augmentation to enhance model generalization in clinical text de-identification tasks.
Findings
PHICON improves F1-score by up to 8.6% on cross-dataset tests.
Augmentation methods significantly influence model performance.
PHICON enhances generalization across different clinical datasets.
Abstract
De-identification is the task of identifying protected health information (PHI) in the clinical text. Existing neural de-identification models often fail to generalize to a new dataset. We propose a simple yet effective data augmentation method PHICON to alleviate the generalization issue. PHICON consists of PHI augmentation and Context augmentation, which creates augmented training corpora by replacing PHI entities with named-entities sampled from external sources, and by changing background context with synonym replacement or random word insertion, respectively. Experimental results on the i2b2 2006 and 2014 de-identification challenge datasets show that PHICON can help three selected de-identification models boost F1-score (by at most 8.6%) on cross-dataset test setting. We also discuss how much augmentation to use and how each augmentation method influences the performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
