Relational Deep Dive: Error-Aware Queries Over Unstructured Data
Daren Chao, Kaiwen Chen, Naiqing Guan, Nick Koudas

TL;DR
ReDD is a novel framework that dynamically discovers schemas and extracts data from unstructured sources with error guarantees, significantly reducing extraction errors and enabling accurate analytical queries.
Contribution
ReDD introduces a two-stage schema discovery and data population framework with error-aware guarantees, including the SCAPE method for calibrated error detection.
Findings
Reduces extraction errors from 30% to below 1%.
Achieves 100% schema recall and high precision.
Enables accuracy-cost trade-offs for high-stakes queries.
Abstract
Unstructured data is pervasive, but analytical queries demand structured representations, creating a significant extraction challenge. Existing methods like RAG lack schema awareness and struggle with cross-document alignment, leading to high error rates. We propose ReDD (Relational Deep Dive), a framework that dynamically discovers query-specific schemas, populates relational tables, and ensures error-aware extraction with provable guarantees. ReDD features a two-stage pipeline: (1) Iterative Schema Discovery (ISD) identifies minimal, joinable schemas tailored to each query, and (2) Tabular Data Population (TDP) extracts and corrects data using lightweight classifiers trained on LLM hidden states. A main contribution of ReDD is SCAPE, a statistically calibrated method for error detection with coverage guarantees, and SCAPE-HYB, a hybrid approach that optimizes the trade-off between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Web Data Mining and Analysis
