PROBER: Ad-Hoc Debugging of Extraction and Integration Pipelines

Anish Das Sarma; Alpa Jain; Philip Bohannon

arXiv:1004.1614·cs.DB·April 12, 2010·2 cites

PROBER: Ad-Hoc Debugging of Extraction and Integration Pipelines

Anish Das Sarma, Alpa Jain, Philip Bohannon

PDF

Open Access

TL;DR

PROBER is a system that enables interactive debugging and provenance analysis of complex, heterogeneous information extraction pipelines, improving post-mortem analysis efficiency for large-scale web data extraction.

Contribution

It introduces a generic debugger and provenance model for IE pipelines, allowing effective analysis of diverse operators without requiring detailed specifications.

Findings

01

Successfully applied to large-scale web data extraction

02

Improved debugging efficiency for complex pipelines

03

Effective provenance inference across diverse operators

Abstract

Complex information extraction (IE) pipelines assembled by plumbing together off-the-shelf operators, specially customized operators, and operators re-used from other text processing pipelines are becoming an integral component of most text processing frameworks. A critical task faced by the IE pipeline user is to run a post-mortem analysis on the output. Due to the diverse nature of extraction operators (often implemented by independent groups), it is time consuming and error-prone to describe operator semantics formally or operationally to a provenance system. We introduce the first system that helps IE users analyze pipeline semantics and infer provenance interactively while debugging. This allows the effort to be proportional to the need, and to focus on the portions of the pipeline under the greatest suspicion. We present a generic debugger for running post-execution analysis of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Web Data Mining and Analysis · Data Quality and Management