A framework for extraction and transformation of documents
Cristian Riveros, Markus L. Schmid, Nicole Schweikardt

TL;DR
This paper introduces a theoretical framework combining document spanners and polyregular functions for efficient extraction and transformation of text documents, with practical extensions and evaluation algorithms.
Contribution
It extends document spanners with multispan-tuples, studies linear ET programs, and provides algorithms for their evaluation with linear preprocessing and constant delay.
Findings
Linear ET programs are as expressive as nondeterministic streaming string transducers.
Linear ET programs are closed under composition.
Enumeration algorithm achieves linear preprocessing and constant delay.
Abstract
We present a theoretical framework for the extraction and transformation of text documents. We propose to use a two-phase process where the first phase extracts span-tuples from a document, and the second phase maps the content of the span-tuples into new documents. We base the extraction phase on the framework of document spanners and the transformation phase on the theory of polyregular functions, the class of regular string-to-string functions with polynomial growth. For supporting practical extract-transform scenarios, we propose an extension of document spanners described by regex formulas from span-tuples to so-called multispan-tuples, where variables are mapped to sets of spans. We prove that this extension, called regex multispanners, has the same desirable properties as standard spanners described by regex formulas. In our framework, an Extract-Transform (ET) program is given…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
