SNaRe: Domain-aware Data Generation for Low-Resource Event Detection
Tanmay Parekh, Yuxuan Dong, Lucas Bandarkar, Artin Kim, I-Hung Hsu, Kai-Wei Chang, Nanyun Peng

TL;DR
SNaRe is a domain-aware synthetic data generation framework that improves event detection in specialized fields by reducing label noise and domain drift, leading to higher accuracy in low-resource settings.
Contribution
The paper introduces SNaRe, a novel framework with three components that enhances domain-specific data generation for event detection, addressing label noise and domain mismatch issues.
Findings
Outperforms baselines with 3-7% F1 gains in zero/few-shot settings
Achieves 4-20% F1 improvement in multilingual generation
Human evaluation confirms higher annotation quality
Abstract
Event Detection (ED) -- the task of identifying event mentions from natural language text -- is critical for enabling reasoning in highly specialized domains such as biomedicine, law, and epidemiology. Data generation has proven to be effective in broadening its utility to wider applications without requiring expensive expert annotations. However, when existing generation approaches are applied to specialized domains, they struggle with label noise, where annotations are incorrect, and domain drift, characterized by a distributional mismatch between generated sentences and the target domain. To address these issues, we introduce SNaRe, a domain-aware synthetic data generation framework composed of three components: Scout, Narrator, and Refiner. Scout extracts triggers from unlabeled target domain data and curates a high-quality domain-specific trigger list using corpus-level statistics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Text Readability and Simplification
