Reprint: a randomized extrapolation based on principal components for   data augmentation

Le Li; Jiale Wei; Pai Peng; Qiyuan Chen; Benjamin Guedj; Bo Cai

arXiv:2204.12024·cs.CL·December 11, 2024

Reprint: a randomized extrapolation based on principal components for data augmentation

Le Li, Jiale Wei, Pai Peng, Qiyuan Chen, Benjamin Guedj, Bo Cai

PDF

1 Repo

TL;DR

REPRINT is a simple, effective data augmentation method that uses principal component extrapolation in hidden space to improve imbalanced text classification, with stable results and low computational cost.

Contribution

It introduces a novel hidden-space extrapolation technique for data augmentation that enhances class balance and model robustness in NLP tasks.

Findings

01

REPRINT outperforms existing augmentation methods on four text classification benchmarks.

02

Label refinement improves the quality of augmented data.

03

The method is robust across different principal component choices.

Abstract

Data scarcity and data imbalance have attracted a lot of attention in many fields. Data augmentation, explored as an effective approach to tackle them, can improve the robustness and efficiency of classification models by generating new samples. This paper presents REPRINT, a simple and effective hidden-space data augmentation method for imbalanced data classification. Given hidden-space representations of samples in each class, REPRINT extrapolates, in a randomized fashion, augmented examples for target class by using subspaces spanned by principal components to summarize distribution structure of both source and target class. Consequently, the examples generated would diversify the target while maintaining the original geometry of target distribution. Besides, this method involves a label refinement component which allows to synthesize new soft labels for augmented examples. Compared…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bigdata-ccnu/reprint
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.