TL;DR
REPRINT is a simple, effective data augmentation method that uses principal component extrapolation in hidden space to improve imbalanced text classification, with stable results and low computational cost.
Contribution
It introduces a novel hidden-space extrapolation technique for data augmentation that enhances class balance and model robustness in NLP tasks.
Findings
REPRINT outperforms existing augmentation methods on four text classification benchmarks.
Label refinement improves the quality of augmented data.
The method is robust across different principal component choices.
Abstract
Data scarcity and data imbalance have attracted a lot of attention in many fields. Data augmentation, explored as an effective approach to tackle them, can improve the robustness and efficiency of classification models by generating new samples. This paper presents REPRINT, a simple and effective hidden-space data augmentation method for imbalanced data classification. Given hidden-space representations of samples in each class, REPRINT extrapolates, in a randomized fashion, augmented examples for target class by using subspaces spanned by principal components to summarize distribution structure of both source and target class. Consequently, the examples generated would diversify the target while maintaining the original geometry of target distribution. Besides, this method involves a label refinement component which allows to synthesize new soft labels for augmented examples. Compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
