Self-Guided Noise-Free Data Generation for Efficient Zero-Shot Learning
Jiahui Gao, Renjie Pi, Yong Lin, Hang Xu, Jiacheng Ye, Zhiyong Wu,, Weizhong Zhang, Xiaodan Liang, Zhenguo Li, Lingpeng Kong

TL;DR
This paper introduces SunGen, a noise-robust re-weighting framework that automatically filters high-quality synthetic data from large language models, significantly improving zero-shot classification performance without manual tuning.
Contribution
The paper presents SunGen, a novel automatic data filtering method that enhances zero-shot learning by re-weighting synthetic data without human intervention.
Findings
SunGen improves average accuracy by 9.8% across eight tasks.
The method effectively filters low-quality synthetic samples.
Theoretical and empirical analysis confirms data quality improvements.
Abstract
There is a rising interest in further exploring the zero-shot learning potential of large pre-trained language models (PLMs). A new paradigm called data-generation-based zero-shot learning has achieved impressive success. In this paradigm, the synthesized data from the PLM acts as the carrier of knowledge, which is used to train a task-specific model with orders of magnitude fewer parameters than the PLM, achieving both higher performance and efficiency than prompt-based zero-shot learning methods on PLMs. The main hurdle of this approach is that the synthesized data from PLM usually contains a significant portion of low-quality samples. Fitting on such data will greatly hamper the performance of the task-specific model, making it unreliable for deployment. Previous methods remedy this issue mainly by filtering synthetic data using heuristic metrics(e.g., output confidence), or refining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · COVID-19 diagnosis using AI
