Split-Correctness in Information Extraction
Johannes Doleschal, Benny Kimelfeld, Wim Martens, Frank Neven, and Matthias Niewerth

TL;DR
This paper introduces a formal framework for split-correctness in information extraction, enabling more efficient, parallel, and incremental processing of large documents by detecting whether extractors operate correctly on segmented text.
Contribution
It formalizes split-correctness within document spanners, analyzes its complexity for regular spanners, and explores variants involving black-box extractors with split constraints.
Findings
Split-correctness can be formally characterized within the spanner framework.
The complexity of checking split-correctness varies with spanner type.
Variants with black-box extractors introduce additional challenges.
Abstract
Programs for extracting structured information from text, namely information extractors, often operate separately on document segments obtained from a generic splitting operation such as sentences, paragraphs, k-grams, HTTP requests, and so on. An automated detection of this behavior of extractors, which we refer to as split-correctness, would allow text analysis systems to devise query plans with parallel evaluation on segments for accelerating the processing of large documents. Other applications include the incremental evaluation on dynamic content, where re-evaluation of information extractors can be restricted to revised segments, and debugging, where developers of information extractors are informed about potential boundary crossing of different semantic components. We propose a new formal framework for split-correctness within the formalism of document spanners. Our analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Web Data Mining and Analysis
