Relational Deep Dive: Error-Aware Queries Over Unstructured Data

Daren Chao; Kaiwen Chen; Naiqing Guan; Nick Koudas

arXiv:2511.02711·cs.DB·November 5, 2025

Relational Deep Dive: Error-Aware Queries Over Unstructured Data

Daren Chao, Kaiwen Chen, Naiqing Guan, Nick Koudas

PDF

Open Access

TL;DR

ReDD is a novel framework that dynamically discovers schemas and extracts data from unstructured sources with error guarantees, significantly reducing extraction errors and enabling accurate analytical queries.

Contribution

ReDD introduces a two-stage schema discovery and data population framework with error-aware guarantees, including the SCAPE method for calibrated error detection.

Findings

01

Reduces extraction errors from 30% to below 1%.

02

Achieves 100% schema recall and high precision.

03

Enables accuracy-cost trade-offs for high-stakes queries.

Abstract

Unstructured data is pervasive, but analytical queries demand structured representations, creating a significant extraction challenge. Existing methods like RAG lack schema awareness and struggle with cross-document alignment, leading to high error rates. We propose ReDD (Relational Deep Dive), a framework that dynamically discovers query-specific schemas, populates relational tables, and ensures error-aware extraction with provable guarantees. ReDD features a two-stage pipeline: (1) Iterative Schema Discovery (ISD) identifies minimal, joinable schemas tailored to each query, and (2) Tabular Data Population (TDP) extracts and corrects data using lightweight classifiers trained on LLM hidden states. A main contribution of ReDD is SCAPE, a statistically calibrated method for error detection with coverage guarantees, and SCAPE-HYB, a hybrid approach that optimizes the trade-off between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Advanced Database Systems and Queries · Web Data Mining and Analysis