Joining Extractions of Regular Expressions

Dominik D. Freydenberger; Benny Kimelfeld; Liat Peterfreund

arXiv:1703.10350·cs.DB·March 31, 2017·1 cites

Joining Extractions of Regular Expressions

Dominik D. Freydenberger, Benny Kimelfeld, Liat Peterfreund

PDF

Open Access

TL;DR

This paper explores the computational complexity of querying text with regex formulas and relational algebra, revealing both hardness results and conditions under which efficient evaluation is possible.

Contribution

It extends relational query complexity results to regex-based information extraction, showing new hardness results and polynomial delay evaluation methods.

Findings

01

NP-completeness and W[1]-hardness hold even for single-character text

02

Acyclic CQs are hard to compute in this setting

03

UCQ evaluation is polynomial delay with bounded atoms per CQ

Abstract

Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (interval positions) from text. These relations can be further manipulated via Relational Algebra as studied in the context of document spanners, Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text! Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Database Systems and Queries · Algorithms and Data Compression · Natural Language Processing Techniques