A framework for extraction and transformation of documents

Cristian Riveros; Markus L. Schmid; Nicole Schweikardt

arXiv:2405.12350·cs.DB·May 22, 2024

A framework for extraction and transformation of documents

Cristian Riveros, Markus L. Schmid, Nicole Schweikardt

PDF

TL;DR

This paper introduces a theoretical framework combining document spanners and polyregular functions for efficient extraction and transformation of text documents, with practical extensions and evaluation algorithms.

Contribution

It extends document spanners with multispan-tuples, studies linear ET programs, and provides algorithms for their evaluation with linear preprocessing and constant delay.

Findings

01

Linear ET programs are as expressive as nondeterministic streaming string transducers.

02

Linear ET programs are closed under composition.

03

Enumeration algorithm achieves linear preprocessing and constant delay.

Abstract

We present a theoretical framework for the extraction and transformation of text documents. We propose to use a two-phase process where the first phase extracts span-tuples from a document, and the second phase maps the content of the span-tuples into new documents. We base the extraction phase on the framework of document spanners and the transformation phase on the theory of polyregular functions, the class of regular string-to-string functions with polynomial growth. For supporting practical extract-transform scenarios, we propose an extension of document spanners described by regex formulas from span-tuples to so-called multispan-tuples, where variables are mapped to sets of spans. We prove that this extension, called regex multispanners, has the same desirable properties as standard spanners described by regex formulas. In our framework, an Extract-Transform (ET) program is given…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.