GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction
Neil De La Fuente, Oscar Sainz, Iker Garc\'ia-Ferrero, Eneko Agirre

TL;DR
GUIDEX is a novel method that automatically generates synthetic data and schemas to improve zero-shot information extraction, achieving state-of-the-art results without human-labeled data.
Contribution
The paper introduces GUIDEX, a new approach for automatic schema definition and synthetic data generation to enhance zero-shot IE performance.
Findings
Sets new state-of-the-art across seven NER benchmarks.
Models trained with GUIDEX improve F1 scores by up to 7 points.
Enhanced understanding of complex, domain-specific schemas.
Abstract
Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GUIDEX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning Llama 3.1 with GUIDEX sets a new state-of-the-art across seven zeroshot Named Entity Recognition benchmarks. Models trained with GUIDEX gain up to 7 F1 points over previous methods without humanlabeled data, and nearly 2 F1 points higher when combined with it. Models trained on GUIDEX demonstrate enhanced comprehension of complex, domain-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsWeb Data Mining and Analysis
