IEPile: Unearthing Large-Scale Schema-Based Information Extraction   Corpus

Honghao Gui; Lin Yuan; Hongbin Ye; Ningyu Zhang; Mengshu Sun; Lei; Liang; Huajun Chen

arXiv:2402.14710·cs.CL·May 28, 2024·6 cites

IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

Honghao Gui, Lin Yuan, Hongbin Ye, Ningyu Zhang, Mengshu Sun, Lei, Liang, Huajun Chen

PDF

Open Access 1 Repo 4 Models 1 Datasets

TL;DR

IEPile is a large-scale bilingual instruction corpus designed to improve information extraction capabilities of LLMs, addressing the limitations of existing datasets by providing standardized schema-based data.

Contribution

The paper introduces IEPile, a comprehensive bilingual IE dataset with schema-based instructions, significantly expanding data scale and quality for LLM training.

Findings

01

Enhanced LLM performance in IE tasks

02

Improved zero-shot generalization capabilities

03

Open-sourced dataset and models for community use

Abstract

Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zjunlp/iepile
pytorchOfficial

Models

Datasets

zjunlp/iepile
dataset· 1.5k dl
1.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques