Detecting Opportunities for Differential Maintenance of Extracted Views
Besat Kassaie, Frank Wm. Tompa

TL;DR
This paper formalizes and proposes algorithms for maintaining consistency of extracted views from source documents in semi-structured data, inspired by relational view maintenance and applied to document spanners.
Contribution
It introduces a formal classification of document updates and algorithms for detecting pseudo-irrelevant updates in the context of information extraction using document spanners.
Findings
Classifies document updates into irrelevant, autonomously computable, and pseudo-irrelevant.
Provides algorithms to detect pseudo-irrelevant updates.
Extends view maintenance concepts to semi-structured data extraction.
Abstract
Semi-structured and unstructured data management is challenging, but many of the problems encountered are analogous to problems already addressed in the relational context. In the area of information extraction, for example, the shift from engineering ad hoc, application-specific extraction rules towards using expressive languages such as CPSL and AQL creates opportunities to propose solutions that can be applied to a wide range of extraction programs. In this work, we focus on extracted view maintenance, a problem that is well-motivated and thoroughly addressed in the relational setting. In particular, we formalize and address the problem of keeping extracted relations consistent with source documents that can be arbitrarily updated. We formally characterize three classes of document updates, namely those that are irrelevant, autonomously computable, and pseudo-irrelevant with respect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Service-Oriented Architecture and Web Services · Software Engineering Research
