Constant delay algorithms for regular document spanners

Fernando Florenzano; Cristian Riveros; Martin Ugarte; Stijn; Vansummeren; Domagoj Vrgoc

arXiv:1803.05277·cs.DB·March 15, 2018

Constant delay algorithms for regular document spanners

Fernando Florenzano, Cristian Riveros, Martin Ugarte, Stijn, Vansummeren, Domagoj Vrgoc

PDF

TL;DR

This paper introduces a practical constant delay enumeration algorithm for regular document spanners, enabling efficient extraction of data from texts with minimal precomputation, applicable to various automata-based formalisms.

Contribution

It presents a new constant delay enumeration algorithm for regular document spanners with linear precomputation, and analyzes its applicability to different spanner formalisms.

Findings

01

Algorithm achieves constant delay enumeration after linear precomputation.

02

Applicability extends to various automata and regex-based spanners.

03

Provides complexity analysis for counting spanner outputs.

Abstract

Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have good evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Towards this goal, we present a practical evaluation algorithm that allows constant delay enumeration of a spanner's output after a precomputation phase that is linear in the document. While the algorithm assumes that the spanner is specified in a syntactic variant of variable set automata, we also study how it can be applied when the spanner is specified by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.