SYNAPSE-G: Bridging Large Language Models and Graph Learning for Rare Event Classification
Sasan Tavakkol, Lin Chen, Max Springer, Abigail Schantz, Bla\v{z} Bratani\v{c}, Vincent Cohen-Addad, MohammadHossein Bateni

TL;DR
SYNAPSE-G introduces a novel pipeline that uses large language models to generate synthetic data and semi-supervised learning to improve rare event classification, addressing data scarcity issues.
Contribution
The paper presents SYNAPSE-G, a new method combining LLM-generated synthetic data with graph-based label propagation for rare event classification.
Findings
Outperforms baseline methods like nearest neighbor search.
Effectively identifies positive examples in imbalanced datasets.
Theoretically analyzes the impact of synthetic data quality.
Abstract
Scarcity of labeled data, especially for rare events, hinders training effective machine learning models. This paper proposes SYNAPSE-G (Synthetic Augmentation for Positive Sampling via Expansion on Graphs), a novel pipeline leveraging Large Language Models (LLMs) to generate synthetic training data for rare event classification, addressing the cold-start problem. This synthetic data serve as seeds for semi-supervised label propagation on a similarity graph constructed between the seeds and a large unlabeled dataset. This identifies candidate positive examples, subsequently labeled by an oracle (human or LLM). The expanded dataset then trains/fine-tunes a classifier. We theoretically analyze how the quality (validity and diversity) of the synthetic data impacts the precision and recall of our method. Experiments on the imbalanced SST2 and MHS datasets demonstrate SYNAPSE-G's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
