To not miss the forest for the trees -- a holistic approach for explaining missing answers over nested data (extended version)
Ralf Diestelkaemper, Seokki Lee, Melanie Herschel, Boris Glavic

TL;DR
This paper introduces a novel, scalable approach for explaining missing answers in nested data queries, considering schema-modifying operators, and demonstrates its effectiveness on large datasets in Spark.
Contribution
It is the first to support nested data and schema-modifying operators in query explanations, with a heuristic algorithm that scales to large datasets.
Findings
Supports nested data and schema modifications
Scales efficiently to large datasets in Spark
Identifies explanations missed by existing techniques
Abstract
Query-based explanations for missing answers identify which operators of a query are responsible for the failure to return a missing answer of interest. This type of explanations has proven to be useful in a variety of contexts including debugging of complex analytical queries. Such queries are frequent in big data systems such as Apache Spark. We present a novel approach for producing query-based explanations. Our approach is the first to support nested data and to consider operators that modify the schema and structure of the data (e.g., nesting and projections) as potential causes of missing answers. To efficiently compute explanations, we propose a heuristic algorithm that applies two novel techniques: (i) reasoning about multiple schema alternatives for a query and (ii) re-validating at each step whether an intermediate result can contribute to the missing answer. Using an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Advanced Database Systems and Queries · Data Quality and Management
