An Annotated Corpus of Webtables for Information Extraction Tasks

Erin Macdonald; Denilson Barbosa

arXiv:2008.07680·cs.IR·November 17, 2020·1 cites

An Annotated Corpus of Webtables for Information Extraction Tasks

Erin Macdonald, Denilson Barbosa

PDF

Open Access

TL;DR

This paper presents a large, annotated dataset of Wikipedia tables with 28 relations, enabling improved relation extraction from structured web data and providing a new benchmark for future research.

Contribution

It introduces a novel annotation framework and a comprehensive dataset of over 217,000 tables with relation annotations, filling a gap in standard benchmarks for table-based information extraction.

Findings

01

Achieved 94% annotation accuracy using classifiers.

02

Created the first publicly available large-scale table dataset with relation annotations.

03

Facilitated future research in table-based relation extraction.

Abstract

Information Extraction is a well-researched area of Natural Language Processing with applications in web search and question answering concerned with identifying entities and relationships between them as expressed in a given context, usually a sentence of a paragraph of running text. Given the importance of the task, several datasets and benchmarks have been curated over the years. However, focusing on running text alone leaves out tables which are common in many structured documents and in which pairs of entities also co-occur in context (e.g., the same row of the table). While there are recent papers on relation extraction from tables in the literature, their experimental evaluations have been on ad-hoc datasets for the lack of a standard benchmark. This paper helps close that gap. We introduce an annotation framework and a dataset of 217,834 tables from Wikipedia which are annotated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Data Quality and Management · Natural Language Processing Techniques