PROBER: Ad-Hoc Debugging of Extraction and Integration Pipelines
Anish Das Sarma, Alpa Jain, Philip Bohannon

TL;DR
PROBER is a system that enables interactive debugging and provenance analysis of complex, heterogeneous information extraction pipelines, improving post-mortem analysis efficiency for large-scale web data extraction.
Contribution
It introduces a generic debugger and provenance model for IE pipelines, allowing effective analysis of diverse operators without requiring detailed specifications.
Findings
Successfully applied to large-scale web data extraction
Improved debugging efficiency for complex pipelines
Effective provenance inference across diverse operators
Abstract
Complex information extraction (IE) pipelines assembled by plumbing together off-the-shelf operators, specially customized operators, and operators re-used from other text processing pipelines are becoming an integral component of most text processing frameworks. A critical task faced by the IE pipeline user is to run a post-mortem analysis on the output. Due to the diverse nature of extraction operators (often implemented by independent groups), it is time consuming and error-prone to describe operator semantics formally or operationally to a provenance system. We introduce the first system that helps IE users analyze pipeline semantics and infer provenance interactively while debugging. This allows the effort to be proportional to the need, and to focus on the portions of the pipeline under the greatest suspicion. We present a generic debugger for running post-execution analysis of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Web Data Mining and Analysis · Data Quality and Management
