Language-integrated provenance by trace analysis

Stefan Fehrenbach; James Cheney

arXiv:1905.02051·cs.PL·May 7, 2019

Language-integrated provenance by trace analysis

Stefan Fehrenbach, James Cheney

PDF

TL;DR

This paper introduces a novel self-tracing transformation and trace analysis method that enables flexible, language-integrated provenance tracking within a core language extension for Links, enhancing query result explainability.

Contribution

It presents a new self-tracing transformation and trace analysis framework integrated into LinksT, allowing customizable provenance definitions as user code.

Findings

01

Demonstrates the design and implementation of the approach.

02

Shows examples of provenance capabilities.

03

Outlines key correctness properties.

Abstract

Language-integrated provenance builds on language-integrated query techniques to make provenance information explaining query results readily available to programmers. In previous work we have explored language-integrated approaches to provenance in Links and Haskell. However, implementing a new form of provenance in a language-integrated way is still a major challenge. We propose a self-tracing transformation and trace analysis features that, together with existing techniques for type-directed generic programming, make it possible to define different forms of provenance as user code. We present our design as an extension to a core language for Links called LinksT, give examples showing its capabilities, and outline its metatheory and key correctness properties.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\newunicodechar

⟦ $\llbracket$ \newunicodechar⟧ $\rrbracket$ \newunicodechar↦↦ \newunicodechar⋅ $\cdot$ \newunicodechar𝔇 $\mathfrak{D}$ \newunicodechar𝔏 $\mathfrak{L}$ \newunicodechar𝔚 $\mathfrak{W}$ \newunicodecharℰ $\mathcal{E}$ \newunicodecharΓ $\Gamma$ \newunicodechar⊢ $\vdash$ \newunicodecharτ $\tau$ \newunicodechar⟶ $\longrightarrow$ \newunicodechar∧ $\wedge$ \newunicodechar∀ $\forall$ \newunicodecharλ $\lambda$ \newunicodecharα $\alpha$ \newunicodechar✓✓ \newunicodechar⤳ $\leadsto$

Language-integrated provenance by trace analysis

Stefan Fehrenbach

University of EdinburghUnited Kingdom

[email protected]

and

James Cheney

University of Edinburgh and The Alan Turing InstituteUnited Kingdom

[email protected]

(2019)

Abstract.

Language-integrated provenance builds on language-integrated query techniques to make provenance information explaining query results readily available to programmers. In previous work we have explored language-integrated approaches to provenance in Links and Haskell. However, implementing a new form of provenance in a language-integrated way is still a major challenge. We propose a self-tracing transformation and trace analysis features that, together with existing techniques for type-directed generic programming, make it possible to define different forms of provenance as user code. We present our design as an extension to a core language for Links called LinksT, give examples showing its capabilities, and outline its metatheory and key correctness properties.

language-integrated provenance, language-integrated query, query normalization, provenance

††copyright: acmlicensed††price: 15.00††doi: 10.1145/3315507.3330198††journalyear: 2019††isbn: 978-1-4503-6718-9/19/06††conference: Proceedings of the 17th ACM SIGPLAN International Symposium on Database Programming Languages; June 23, 2019; Phoenix, AZ, USA††booktitle: Proceedings of the 17th ACM SIGPLAN International Symposium on Database Programming Languages (DBPL ’19), June 23, 2019, Phoenix, AZ, USA††ccs: Information systems Data provenance††ccs: Software and its engineering Functional languages

1. Introduction

Provenance tracking has been heavily investigated as a means of making database query results explainable (ICDT2001BunemanKT; FTDB2009CheneyCT), for example to explain where in the input some output data came from (where-provenance) or what input records justify the presence of some output record (lineage, why-provenance). Many prototype implementations of provenance-tracking have been developed as ad hoc extensions to (or middleware layers wrapping) ordinary relational database systems (Senellart:2018:PPP:3229863.3275590; GProMArab2014), typically by augmenting the data model with additional annotations and propagating them through the query using an enriched semantics. This approach, however, inhibits reuse and uptake of these techniques since a special (and usually not maintained) variant of the database system must be used. Installing, maintaining and using such research prototypes is not for the faint of heart.

We advocate a language-based approach to provenance, building on language-integrated query (meijer06sigmod; TLDI2012LindleyC; SIGMOD2014CheneyLW). In language-integrated query, database queries are embedded in a programming language as first-class citizens, not uninterpreted strings, and thus benefit from typechecking and other language services. In language-integrated provenance, we aim to support provenance-tracking techniques by modifying the behavior of queries at the language level to track their own provenance. These modified queries can then be used with unmodified, mainstream database systems. To date, Fehrenbach and Cheney (FEHRENBACH2018103) have demonstrated the capabilities of language-integrated provenance in Links, a Web and database programming language, and Stolarek and Cheney (Programming2018StolarekC) adapted this approach to work with Dsh, an existing language-integrated query library in Haskell (SIGMOD2015UlrichG). In both cases, where-provenance and lineage are supported as representative forms of provenance.

However, both approaches explored so far have drawbacks. Our previous implementations of language-integrated provenance in Links are ad hoc language extensions, requiring nontrivial changes to the Links front-end and runtime. It is not obvious how to support both extensions at once, and supporting additional extensions would likewise require a major intervention to the language. In Dsh, we were able to support both forms of provenance at once, but did need to make superficial changes to Dsh and carry out nontrivial type-level programming to make our translations pass Haskell’s typechecker. Thus, in both cases, we feel there is significant room for improvement, to make it easier to develop new forms of provenance without ad hoc language extensions or subtle type-level programming.

In this paper, we present a core language design called LinksT that extends the query language core of Links (a variant of the Nested Relational Calculus (TCS1995BunemanNTW)) with several powerful programming constructs. These include well-studied constructs for type-directed generic programming (e.g. Typerec and typecase) (Harper:1995:CPU:199448.199475), extended to support generic programming with record types (PLDI2010Chlipala). In addition, we propose novel primitives for constructing and analyzing query traces (following (cheney14ppdp)). We will show that these features suffice to define forms of provenance programmatically, using the following recipe. Given a query $q$ , we first transform it to a self-tracing query $q^{\mathsf{T}}$ . We can then compose $q^{\mathsf{T}}$ with a trace analysis function $f^{\mathsf{P}}$ , which is simply an ordinary LinksT function that makes use of the type and trace analysis capabilities. Each form of provenance we support can be defined as a trace analysis function, and can be applied to queries of any type. Thus, $f^{\mathsf{P}}\circ q^{\mathsf{T}}$ defines the intended query result together with the desired provenance. Finally, we normalize $f^{\mathsf{P}}\circ q^{\mathsf{T}}$ to a Nrc expression, which can be further translated to Sql and evaluated efficiently on a mainstream database by the existing language-integrated query implementation in Links (SIGMOD2014CheneyLW). Normalization effectively deforests the traces that would be produced by $q^{\mathsf{T}}$ if we were to execute it directly; thus, executing the normalized Nrc query is typically much faster than executing $q^{\mathsf{T}}$ and then $f^{\mathsf{P}}$ separately would be.

Our main contributions are as follows.

•

We show via examples (Section LABEL:sec:trace-analysis) how a programmer can use type and trace analysis constructs to define different modifications of query behavior, for example to extract where-provenance and lineage from traces.

•

We present the language design of LinksT. We informally introduce the novel trace constructors in Section 3 and present syntax and type system details in Section LABEL:sec:syntax-semantics. This includes traces and trace analysis operations, and reviews the already-studied type-directed generic programming features from previous work.

•

We then present the self-tracing transformation (Section LABEL:sec:self-tracing) and the extended rewrite rule system needed for normalization, and outline the proofs of type preservation and correctness for these components (Section LABEL:sec:normalization).

We have a preliminary implementation, but the main contributions of this paper concern the design and theory, and a full-scale implementation in Links is future work.

2. The problem

As explained earlier, in previous work we have investigated different ways of implementing where-provenance and lineage on top of existing language-integrated query systems, namely Links and Dsh. In both cases, given a query $q$ , we wish to construct another query $q^{\mathsf{P}}$ that provides both the ordinary query results of $q$ and additional annotations that provide some form of information about how query results relate to the input data. Preferably, the transformed query should still be in the same query language as that handled by the existing language-integrated query system, so that this implementation can be reused to generate efficient Sql queries. Of course, in a typed programming language, we also expect the generated query to be well-typed.

For example, for where-provenance, we wish to construct query $q^{\mathsf{where}}$ in which each data field in the query result is annotated with a source location in the input database, which we typically implement as a tuple $(R,A,i)$ consisting of a relation name $R$ , attribute name $A$ , and row identifier (or primary key value) $i$ . Likewise, for lineage, we wish to construct a query $q^{\mathsf{lineage}}$ in which each output record is annotated with a collection of references $(R,i)$ to input records that help “witness” or “justify” the presence of the output record.

As a running example, consider the following boat tours query (in Links syntax). It uses nested \lst@ifdisplaystylefor comprehensions to iterate over two tables, filtering by type and joining on the name columns. It returns a list of records (pairs of field name and value separated by commas and enclosed in angle brackets) containing the agencies names and phone numbers. See Figure 1 for an example input database and result.

⬇

for (e <- externalTours) where (e.type == ”boat”)

for (a <- agencies) where (a.name == e.name)

[{name = e.name, phone = a.phone}]

\lst@ifdisplaystyle

The where-provenance translation of this query should annotate the field value Burns’s in the result with where-provenance annotation (ExternalTours, name, 7), and the lineage translation should annotate the row (Burns’s, 607 3000) with lineage annotation [(Agencies,2), (ExternalTours,7)]. (Note that in lineage, the annotation of each row is a collection of input row references; both Links and Dsh can already handle such nested query results (SIGMOD2014CheneyLW; SIGMOD2015UlrichG).)

In our previous work, we have implemented these translations either by directly changing the language implementation (in Links), or by making nontrivial modifications to a language-integrated query library (in Dsh). While this work shows that it is possible to provide (reasonably efficient) language-integrated provenance via source-to-source translation of queries, both approaches are still nontrivial interventions to an existing implementation, and so developing new forms of provenance, or variations on existing ones, is still a considerable challenge.

If we wish to provide the necessary query transformation capability using high-level programming constructs, then we face two significant challenges. First, transforming the query expression in the direct approaches considered so far relies on fairly heavyweight metaprogramming capabilities, and type-safe metaprogramming by reflection over object languages with binding constructs (such as comprehensions in queries) is a significant challenge. Based on prior work on general forms of provenance such as traces (10.1007/978-3-642-28641-4_22; cheney14ppdp) or provenance polynomials (green07pods), we might hope to avoid the need for heavyweight metaprogramming by computing a single, general form of query trace once and for all, and specializing it to different forms of provenance later. However, this raises the question of how to design a suitable tracing framework and how to provide appropriate language constructs that can specialize traces to different forms of provenance, in a type-safe and efficient way. (In particular, we cannot simply reuse the provenance polynomials/semirings framework since it is not able to capture where-provenance (FTDB2009CheneyCT).)

Second, and related to the previous point, we need to change not only the query behavior but also the query result type. Specifically, in the type of $q^{\mathsf{where}}$ , each field is replaced with a record consisting of the ordinary data value and its where-provenance annotation, whereas in $q^{\mathsf{lineage}}$ , each element of a collection in the query result type becomes a pair consisting of the original data and a collection of input row references. In previous implementations, we have added this behavior to the typechecker directly (in Links), or (in Dsh) used type families (ChaKelPey05) to define the effect of the where-provenance or lineage transformations at the type level. In the case of Dsh, this necessitated subtle changes to the Dsh library, as well as defining evidence translations at the type and term levels to convince Haskell’s typechecker that our definitions were type-correct.

Thus, in both Links and Dsh, our previous work has shown that it is possible to implement language-integrated provenance, but the need to manipulate both query expressions and their types makes this more difficult than we might hope. Our goal, therefore, is to identify a small set of language features that addresses all of the above needs well: we would like to be able to customize the query behavior to handle multiple forms of provenance, while retaining the existing benefits demonstrated by previous implementations of language-integrated provenance: specifically type-safety and efficient query generation.

3. Query traces

In this section we describe what our traces look like through a series of examples. We show how to rewrite expressions to compute their own trace in Section LABEL:sec:self-tracing. As described earlier, the intent is to compose a trace analysis function with a self-tracing query and normalize to deforest the trace and only compute the parts that we actually need.

The \lst@ifdisplaystyletrace keyword causes a query expression to be traced. For example, \lst@ifdisplaystyletrace 2+3 has type \lst@ifdisplaystyleTrace Int and evaluates to \lst@ifdisplaystyleOpPlus{l=Lit 2,r=Lit 3}. Here, \lst@ifdisplaystyleOpPlus represents an addition operation and its argument is a record of the left and right subtraces, and \lst@ifdisplaystyleLit is the constructor for traces of literal values. Traces of records are just records of traces, and traces of lists are just lists of traces, e.g., tracing the singleton list of the singleton record \lst@ifdisplaystyle[{answer=42}] results in \lst@ifdisplaystyle[{answer=Lit 42}].

In general, the trace of an expression with type $A$ has a type where every base type is replaced by the traced version of the base type, but all list and record constructors stay the same. We can express this in LinksT directly as the type-level function \lst@ifdisplaystyleTRACE defined in Figure 2. We capitalize type-level entities (except variables) and trace constructors, and write type-level functions in all uppercase. Typerec folds over a type, in this case the type variable \lst@ifdisplaystylea. It uses its first three arguments for base types (in our case replacing Boolwith \lst@ifdisplaystyleTrace Bool, etc.). The next argument is used if the argument is a list type and applied to the original element type and the recursively transformed element type. The next arguments work similarly for records and trace types.

Tables are typed as lists of records. Their traces reveal that they are not constants in the query however. Values originating from tables are marked with the \lst@ifdisplaystyleCell constructor. For example, the trace of the agencies table looks like this:

\lst@ifdisplaystyle