HistRED: A Historical Document-Level Relation Extraction Dataset

Soyoung Yang; Minseok Choi; Youngwoo Cho; Jaegul Choo

arXiv:2307.04285·cs.CL·July 11, 2023

HistRED: A Historical Document-Level Relation Extraction Dataset

Soyoung Yang, Minseok Choi, Youngwoo Cho, Jaegul Choo

PDF

Open Access 1 Repo 1 Datasets

TL;DR

HistRED is a new bilingual dataset for historical document-level relation extraction, enabling research on Korean and Hanja texts with diverse context lengths, and demonstrating improved RE performance using multi-language information.

Contribution

We introduce HistRED, a novel dataset for historical relation extraction with bilingual annotations and variable text lengths, and propose a model leveraging both languages for better accuracy.

Findings

01

Our bilingual model outperforms monolingual baselines.

02

HistRED supports diverse context lengths for robust evaluation.

03

The dataset is publicly available for research use.

Abstract

Despite the extensive applications of relation extraction (RE) tasks in various domains, little has been explored in the historical context, which contains promising data across hundreds and thousands of years. To promote the historical RE research, we present HistRED constructed from Yeonhaengnok. Yeonhaengnok is a collection of records originally written in Hanja, the classical Chinese writing, which has later been translated into Korean. HistRED provides bilingual annotations such that RE can be performed on Korean and Hanja texts. In addition, HistRED supports various self-contained subtexts with different lengths, from a sentence level to a document level, supporting diverse context settings for researchers to evaluate the robustness of their RE models. To demonstrate the usefulness of our dataset, we propose a bilingual RE model that leverages both Korean and Hanja contexts to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/Soyoung/HistRED
noneOfficial

Datasets

Soyoung/HistRED
dataset· 53 dl
53 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies