Split-Correctness in Information Extraction

Johannes Doleschal; Benny Kimelfeld; Wim Martens; Frank Neven; and Matthias Niewerth

arXiv:1810.03367·cs.DB·May 21, 2021

Split-Correctness in Information Extraction

Johannes Doleschal, Benny Kimelfeld, Wim Martens, Frank Neven, and Matthias Niewerth

PDF

Open Access

TL;DR

This paper introduces a formal framework for split-correctness in information extraction, enabling more efficient, parallel, and incremental processing of large documents by detecting whether extractors operate correctly on segmented text.

Contribution

It formalizes split-correctness within document spanners, analyzes its complexity for regular spanners, and explores variants involving black-box extractors with split constraints.

Findings

01

Split-correctness can be formally characterized within the spanner framework.

02

The complexity of checking split-correctness varies with spanner type.

03

Variants with black-box extractors introduce additional challenges.

Abstract

Programs for extracting structured information from text, namely information extractors, often operate separately on document segments obtained from a generic splitting operation such as sentences, paragraphs, k-grams, HTTP requests, and so on. An automated detection of this behavior of extractors, which we refer to as split-correctness, would allow text analysis systems to devise query plans with parallel evaluation on segments for accelerating the processing of large documents. Other applications include the incremental evaluation on dynamic content, where re-evaluation of information extractors can be restricted to revised segments, and debugging, where developers of information extractors are informed about potential boundary crossing of different semantic components. We propose a new formal framework for split-correctness within the formalism of document spanners. Our analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Web Data Mining and Analysis