Mixing set and bag semantics

Wilmer Ricciotti; James Cheney

arXiv:1905.02069·cs.DB·May 7, 2019

Mixing set and bag semantics

Wilmer Ricciotti, James Cheney

PDF

TL;DR

This paper explores the formal foundations of queries that combine set and bag semantics, introducing a calculus, rewrite rules, and translation considerations, highlighting open questions about their expressiveness.

Contribution

It formalizes the problem of mixed set and bag query semantics, proposes a calculus with rewrite rules, and characterizes queries translatable to SQL.

Findings

01

Introduces a calculus with set and multiset types

02

Provides rewrite rules for query normalization

03

Characterizes queries translatable to SQL

Abstract

The conservativity theorem for nested relational calculus implies that query expressions can freely use nesting and unnesting, yet as long as the query result type is a flat relation, these capabilities do not lead to an increase in expressiveness over flat relational queries. Moreover, Wong showed how such queries can be translated to SQL via a constructive rewriting algorithm. While this result holds for queries over either set or multiset semantics, to the best of our knowledge, the questions of conservativity and normalization have not been studied for queries that mix set and bag collections, or provide duplicate-elimination operations such as SQL's $SELECT DISTINCT$ . In this paper we formalize the problem, and present partial progress: specifically, we introduce a calculus with both set and multiset collection types, along with natural mappings from sets to bags…

Equations33

\begin{array}[]{rcl}S,T&::=&A\mathbin{~{}|~{}}\langle{\overrightarrow{\ell:T}}\rangle\mathbin{~{}|~{}}\{{T}\}\mathbin{~{}|~{}}\Lbag{T}\Rbag\\ L,M,N&::=&x\mathbin{~{}|~{}}c(\overrightarrow{M})\mathbin{~{}|~{}}\langle{\overrightarrow{\ell=M}}\rangle\mathbin{~{}|~{}}M.\ell\\ &\mathbin{~{}|~{}}&\mathrm{where}_{\mathsf{set}}~{}M~{}\mathrm{do}~{}N\mathbin{~{}|~{}}\mathrm{where}_{\mathsf{bag}}~{}M~{}\mathrm{do}~{}N\\ &\mathbin{~{}|~{}}&\emptyset\mathbin{~{}|~{}}\{{M}\}\mathbin{~{}|~{}}M\cup N\mathbin{~{}|~{}}\bigcup\{{M|x\leftarrow N}\}\\ &\mathbin{~{}|~{}}&\Lbag\Rbag\mathbin{~{}|~{}}\Lbag{M}\Rbag\mathbin{~{}|~{}}M\uplus N\mathbin{~{}|~{}}\biguplus\Lbag{M|x\leftarrow N}\Rbag\\ &\mathbin{~{}|~{}}&\delta M\mathbin{~{}|~{}}\iota M\end{array}

\begin{array}[]{rcl}S,T&::=&A\mathbin{~{}|~{}}\langle{\overrightarrow{\ell:T}}\rangle\mathbin{~{}|~{}}\{{T}\}\mathbin{~{}|~{}}\Lbag{T}\Rbag\\ L,M,N&::=&x\mathbin{~{}|~{}}c(\overrightarrow{M})\mathbin{~{}|~{}}\langle{\overrightarrow{\ell=M}}\rangle\mathbin{~{}|~{}}M.\ell\\ &\mathbin{~{}|~{}}&\mathrm{where}_{\mathsf{set}}~{}M~{}\mathrm{do}~{}N\mathbin{~{}|~{}}\mathrm{where}_{\mathsf{bag}}~{}M~{}\mathrm{do}~{}N\\ &\mathbin{~{}|~{}}&\emptyset\mathbin{~{}|~{}}\{{M}\}\mathbin{~{}|~{}}M\cup N\mathbin{~{}|~{}}\bigcup\{{M|x\leftarrow N}\}\\ &\mathbin{~{}|~{}}&\Lbag\Rbag\mathbin{~{}|~{}}\Lbag{M}\Rbag\mathbin{~{}|~{}}M\uplus N\mathbin{~{}|~{}}\biguplus\Lbag{M|x\leftarrow N}\Rbag\\ &\mathbin{~{}|~{}}&\delta M\mathbin{~{}|~{}}\iota M\end{array}

⋃ {M ∣ x \leftarrow N, y \leftarrow R} := ⋃ {⋃ {M ∣ y \leftarrow R} ∣ x \leftarrow N}

⋃ {M ∣ x \leftarrow N, y \leftarrow R} := ⋃ {⋃ {M ∣ y \leftarrow R} ∣ x \leftarrow N}

M \cup N

M \cup N

⋃ {M ∣ x \leftarrow N}

\displaystyle\left\llbracket{T}\right\rrbracket=x_{T}

\displaystyle\left\llbracket{T}\right\rrbracket=x_{T}

\displaystyle\left\llbracket{{\color[rgb]{0,0,1}\mathtt{SELECT}}~{}{\overrightarrow{t~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}\ell}}~{}{\color[rgb]{0,0,1}\mathtt{FROM}}~{}{\overrightarrow{Q~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}y}}~{}{\color[rgb]{0,0,1}\mathtt{WHERE}}~{}{B}}\right\rrbracket

\displaystyle\qquad=\biguplus\Lbag{\mathrm{where}_{\mathsf{bag}}~{}\left\llbracket{B}\right\rrbracket~{}\Lbag{\langle{\overrightarrow{\ell=\left\llbracket{t}\right\rrbracket}}\rangle}\Rbag|\overrightarrow{x\leftarrow\left\llbracket{Q}\right\rrbracket}}\Rbag

\displaystyle\left\llbracket{{\color[rgb]{0,0,1}\mathtt{SELECT}}~{}{\color[rgb]{0,0,1}\mathtt{DISTINCT}}~{}{\overrightarrow{t~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}\ell}}~{}{\color[rgb]{0,0,1}\mathtt{FROM}}~{}{\overrightarrow{Q~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}x}}~{}{\color[rgb]{0,0,1}\mathtt{WHERE}}~{}{B}}\right\rrbracket

\displaystyle\qquad=\iota\delta\left\llbracket{{\color[rgb]{0,0,1}\mathtt{SELECT}}~{}{\overrightarrow{t~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}\ell}}~{}{\color[rgb]{0,0,1}\mathtt{FROM}}~{}{\overrightarrow{T~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}x}}~{}{\color[rgb]{0,0,1}\mathtt{WHERE}}~{}{B}}\right\rrbracket

\displaystyle\left\llbracket{Q_{1}~{}{\color[rgb]{0,0,1}\mathtt{UNION}}~{}{\color[rgb]{0,0,1}\mathtt{ALL}}~{}Q_{2}}\right\rrbracket=\left\llbracket{Q_{1}}\right\rrbracket\uplus\left\llbracket{Q_{2}}\right\rrbracket

\displaystyle\left\llbracket{Q_{1}~{}{\color[rgb]{0,0,1}\mathtt{UNION}}~{}Q_{2}}\right\rrbracket=\iota\delta\left\llbracket{Q_{1}~{}{\color[rgb]{0,0,1}\mathtt{UNION}}~{}{\color[rgb]{0,0,1}\mathtt{ALL}}~{}Q_{2}}\right\rrbracket

⨄ \Lbag M ∣ x \leftarrow ι ⋃ {{N} ∣ y \leftarrow P} \Rbag ⇝ ???

⨄ \Lbag M ∣ x \leftarrow ι ⋃ {{N} ∣ y \leftarrow P} \Rbag ⇝ ???

⨄ \Lbag J ∣ x \leftarrow t, y \leftarrow ι P \Rbag

⨄ \Lbag J ∣ x \leftarrow t, y \leftarrow ι P \Rbag

⨄ \Lbag J ∣ x \leftarrow t, y \leftarrow ι (P \cup P^{'}) \Rbag

⨄ \Lbag J ∣ x \leftarrow t, y \leftarrow ι (P \cup P^{'}) \Rbag

{\color[rgb]{0,0,1}\mathtt{SELECT}}~{}q~{}{\color[rgb]{0,0,1}\mathtt{FROM}}~{}t~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}x,(P~{}{\color[rgb]{0,0,1}\mathtt{UNION}}~{}P^{\prime})~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}y

{\color[rgb]{0,0,1}\mathtt{SELECT}}~{}q~{}{\color[rgb]{0,0,1}\mathtt{FROM}}~{}t~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}x,(P~{}{\color[rgb]{0,0,1}\mathtt{UNION}}~{}P^{\prime})~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}y

⋃ {M ∣ x \leftarrow δ ⨄ \Lbag \Lbag N \Rbag ∣ y \leftarrow P \Rbag} ⇝ ⋃ {M [N / x] ∣ y \leftarrow δ P}

⋃ {M ∣ x \leftarrow δ ⨄ \Lbag \Lbag N \Rbag ∣ y \leftarrow P \Rbag} ⇝ ⋃ {M [N / x] ∣ y \leftarrow δ P}

\begin{array}[]{l}\iota\bigcup\{{\langle{\mathit{Id}=f.\mathit{Id}}\rangle|f\leftarrow\mathit{FoodEvents}}\}\\ \qquad\uplus\iota\bigcup\{{\langle{\mathit{Id}=b.\mathit{Id}}\rangle|b\leftarrow\mathit{BookEvents}}\}\end{array}

\begin{array}[]{l}\iota\bigcup\{{\langle{\mathit{Id}=f.\mathit{Id}}\rangle|f\leftarrow\mathit{FoodEvents}}\}\\ \qquad\uplus\iota\bigcup\{{\langle{\mathit{Id}=b.\mathit{Id}}\rangle|b\leftarrow\mathit{BookEvents}}\}\end{array}

\begin{array}[]{l}({\color[rgb]{0,0,1}\mathtt{SELECT}}~{}{\color[rgb]{0,0,1}\mathtt{DISTINCT}}~{}f.\mathit{Id}~{}{\color[rgb]{0,0,1}\mathtt{FROM}}~{}{\mathit{FoodEvents}~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}f})~{}{\color[rgb]{0,0,1}\mathtt{UNION}}~{}{\color[rgb]{0,0,1}\mathtt{ALL}}\\ ({\color[rgb]{0,0,1}\mathtt{SELECT}}~{}{\color[rgb]{0,0,1}\mathtt{DISTINCT}}~{}b.\mathit{Id}~{}{\color[rgb]{0,0,1}\mathtt{FROM}}~{}{\mathit{BookEvents}~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}b})\end{array}

\begin{array}[]{l}({\color[rgb]{0,0,1}\mathtt{SELECT}}~{}{\color[rgb]{0,0,1}\mathtt{DISTINCT}}~{}f.\mathit{Id}~{}{\color[rgb]{0,0,1}\mathtt{FROM}}~{}{\mathit{FoodEvents}~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}f})~{}{\color[rgb]{0,0,1}\mathtt{UNION}}~{}{\color[rgb]{0,0,1}\mathtt{ALL}}\\ ({\color[rgb]{0,0,1}\mathtt{SELECT}}~{}{\color[rgb]{0,0,1}\mathtt{DISTINCT}}~{}b.\mathit{Id}~{}{\color[rgb]{0,0,1}\mathtt{FROM}}~{}{\mathit{BookEvents}~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}b})\end{array}

\begin{array}[]{l}\biguplus\Lbag\mathrm{where}_{\mathsf{bag}}~{}(x.A=y.A)~{}\Lbag{\langle{B=x.B,C=y.C}\rangle}\Rbag\\ \quad|x\leftarrow T,y\leftarrow\iota(\delta U\cup\delta V)\Rbag\end{array}

\begin{array}[]{l}\biguplus\Lbag\mathrm{where}_{\mathsf{bag}}~{}(x.A=y.A)~{}\Lbag{\langle{B=x.B,C=y.C}\rangle}\Rbag\\ \quad|x\leftarrow T,y\leftarrow\iota(\delta U\cup\delta V)\Rbag\end{array}

\begin{array}[]{l}{\color[rgb]{0,0,1}\mathtt{SELECT}}~{}x.B,y.C~{}{\color[rgb]{0,0,1}\mathtt{FROM}}~{}T~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}x,(U~{}{\color[rgb]{0,0,1}\mathtt{UNION}}~{}V)~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}y\\ {\color[rgb]{0,0,1}\mathtt{WHERE}}~{}x.A=y.A\end{array}

\begin{array}[]{l}{\color[rgb]{0,0,1}\mathtt{SELECT}}~{}x.B,y.C~{}{\color[rgb]{0,0,1}\mathtt{FROM}}~{}T~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}x,(U~{}{\color[rgb]{0,0,1}\mathtt{UNION}}~{}V)~{}{\color[rgb]{0,0,1}\mathtt{AS}}~{}y\\ {\color[rgb]{0,0,1}\mathtt{WHERE}}~{}x.A=y.A\end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Mixing set and bag semantics

Wilmer Ricciotti

LFCS, School of InformaticsUniversity of Edinburgh

[email protected]

and

James Cheney

LFCS, School of InformaticsUniversity of Edinburgh and The Alan Turing Institute

[email protected]

(2019)

Abstract.

The conservativity theorem for nested relational calculus implies that query expressions can freely use nesting and unnesting, yet as long as the query result type is a flat relation, these capabilities do not lead to an increase in expressiveness over flat relational queries. Moreover, Wong showed how such queries can be translated to SQL via a constructive rewriting algorithm. While this result holds for queries over either set or multiset semantics, to the best of our knowledge, the questions of conservativity and normalization have not been studied for queries that mix set and bag collections, or provide duplicate-elimination operations such as SQL’s ${\color[rgb]{0,0,1}\mathtt{SELECT}}~{}{\color[rgb]{0,0,1}\mathtt{DISTINCT}}$ . In this paper we formalize the problem, and present partial progress: specifically, we introduce a calculus with both set and multiset collection types, along with natural mappings from sets to bags and vice versa, present a set of valid rewrite rules for normalizing such queries, and give an inductive characterization of a set of queries whose normal forms can be translated to SQL. We also consider examples that do not appear straightforward to translate to SQL, illustrating that the relative expressiveness of flat and nested queries with mixed set and multiset semantics remains an open question.

language-integrated query, query normalization

††copyright: acmlicensed††price: 15.00††doi: 10.1145/3315507.3330202††journalyear: 2019††isbn: 978-1-4503-6718-9/19/06††conference: Proceedings of the 17th ACM SIGPLAN International Symposium on Database Programming Languages; June 23, 2019; Phoenix, AZ, USA††booktitle: Proceedings of the 17th ACM SIGPLAN International Symposium on Database Programming Languages (DBPL ’19), June 23, 2019, Phoenix, AZ, USA††ccs: Information systems Structured Query Language††ccs: Software and its engineering Functional languages

1. Introduction

The nested relational calculus (BNTW95, ) provides a principled foundation for integrating database queries into programming languages. Wong’s conservativity theorem (wong96jcss, ) generalized the classic flat-flat theorem (ParedaensG92, ) to show that for any nesting depth $d$ , a query expression over flat input tables returning collections of depth at most $d$ can be expressed without constructing intermediate results of nesting depth greater than $d$ . In the special case $d=1$ , this implies the flat-flat theorem, namely that a nested relational query mapping flat tables to flat tables can be expressed equivalently using the flat relational calculus.

In addition, Wong’s proof technique was constructive, and gave an easily-implemented terminating rewriting algorithm for normalizing NRC queries to equivalent flat queries; these normal forms correspond closely to idiomatic SQL queries and translating from the former to the latter is straightforward. The basic approach has been extended in a number of directions, including to allow for (nonrecursive) higher-order functions in queries (Cooper09, ), and to allow for translating queries that return nested results to a bounded number of flat relational queries (cheney14sigmod, ).

Normalization-based techniques are used in language-integrated query systems such as Kleisli (wong:comprehensions, ) and Links (CLWY06, ). Currently, language-integrated query systems such as C# and F# (meijer:sigmod, ) support duplicate elimination via a ${\color[rgb]{0,0,1}\mathtt{DISTINCT}}$ keyword, which is translated to SQL queries in an ad hoc way, and comes with no guarantees regarding completeness or expressiveness as far as we know, whereas Database-Supported Haskell (DSH) (SIGMOD2015UlrichG, ) supports duplicate elimination but gives all operations list semantics and relies on more sophisticated SQL:1999 features to accomplish this. Fegaras and Maier (DBLP:journals/tods/FegarasM00, ) propose optimization rules for a nested object-relational calculus with set and bag constructs but do not consider the problem of conservativity with respect to flat queries.

Wong’s proof of conservativity also has the nice property that it relies on relatively weak properties of collection types. Thus, it applies both to set and multiset semantics; if we consider nested relational queries over sets, then we can translate to SQL queries using ${\color[rgb]{0,0,1}\mathtt{SELECT}}\;{\color[rgb]{0,0,1}\mathtt{DISTINCT}}$ and ${\color[rgb]{0,0,1}\mathtt{UNION}}$ operations that provide set semantics, while if we consider nested multiset queries we can instead generate plain ${\color[rgb]{0,0,1}\mathtt{SELECT}}$ and ${\color[rgb]{0,0,1}\mathtt{UNION}}\;{\color[rgb]{0,0,1}\mathtt{ALL}}$ operations that do not eliminate duplicates.

SQL itself maintains multiset semantics, but provides several operations that locally employ set semantics, such as ${\color[rgb]{0,0,1}\mathtt{SELECT}}\;{\color[rgb]{0,0,1}\mathtt{DISTINCT}}$ and ${\color[rgb]{0,0,1}\mathtt{UNION}}$ . In a database programming context, it seems natural to consider separate collection types for sets and multisets, so that it is clear from the type of a query expression whether the multiplicity matters. The ability to mix set and multiset queries would be beneficial for an accurate implementation of lineage for Links using the technique proposed by Fehrenbach and Cheney (fehrenbach19, ); however, the consequences of this on the expressiveness of the query language, and the conservativity of nested set/multiset queries over flat ones, do not appear to be well understood. This provides concrete motivation for our work.

In this paper we take some first steps towards conservativity and normal form results for mixed set/multiset queries. We introduce $NRC(Set,Bag)$ , a straightforward generalization of the nested relational calculus that contains two collection types (sets and bags), other standard constructs, and mappings from sets to bags and vice versa. The mapping $\iota$ from sets to bags simply coerces a set to a bag with the same elements, all with multiplicity 1. The mapping $\delta$ from bags to sets performs duplicate elimination: the set corresponding to a given bag consists of all elements of the bag with multiplicity $>0$ . We next show that $NRC(Set,Bag)$ can express conjunctive SQL queries with ${\color[rgb]{0,0,1}\mathtt{SELECT}}\;{\color[rgb]{0,0,1}\mathtt{DISTINCT}}$ and ${\color[rgb]{0,0,1}\mathtt{UNION}}$ , illustrating how idiomatic SQL queries can be written in $NRC(Set,Bag)$ .

We then explore the equational rewriting opportunities afforded by $NRC(Set,Bag)$ . We recapitulate the standard rewriting laws of collection types in NRC, which apply both to sets and to bags individually. We also identify natural properties of $\iota$ and $\delta$ , particularly relating them to set and bag operations. The duplicate elimination operation $\delta$ has several convenient properties, because (as shown by Lellahi and Tannen (tannen, )) it is a monad morphism from the multiset to set monads. However, the converse $\iota$ operation has fewer convenient properties. Nevertheless, $\iota$ and $\delta$ do form a Galois connection between the sets and bags over a given type (ordered by the respective inclusion operations). Specifically, this means that $\iota$ calculates in some sense the optimal bag among all those that approximate a given set, and $\delta$ calculates in some sense the optimal set among all those that approximate a given bag. In fact, this Galois connection is a special case called a Galois insertion, which means that it satisfies $\delta\circ\iota=id$ , that is, if we convert a set to a bag and then eliminate duplicates we get back the original set exactly.

We next discuss the normal forms obtained by applying all possible rewrite rules until no more subexpressions are reducible. (We do not formally explore the termination of this system, but conjecture that it is terminating.) We identify normal forms that can be mapped directly to SQL queries, and give examples for which we do not yet know a systematic translation. Nevertheless, we are able to show a weak conservativity result that is of immediate practical interest: suppose we have queries over flat inputs and returning flat results. If we forbid the use of the $\iota$ operation inside bag comprehensions, then the normal form of any query in this sublanguage is straightforward to translate to SQL.

2. Language overview

We define $NRC(Set,Bag)$ as follows:

[TABLE]

Types include atomic types, record types with named fields, sets and bags. Terms include applied constants, conditional expressions, records with named fields, and various collection terms (empty, singleton, union, and comprehension). In this definition, $x$ ranges over variable names, $c$ over constants, and $\ell$ over record field names. Typing rules for collections are largely standard. We will allow ourselves to use sequences of generators in comprehensions, which are syntactic sugar for nested comprehensions, e.g.:

[TABLE]

We assume an intuitive denotational semantics interpreting these expressions as finite sets and bags, satisfying the following valid rules (among others):

[TABLE]

We can immediately observe the following property about the semantics of $\iota$ and $\delta$ :

Proposition 2.1.

For any type $A$ , the operations $\iota$ and $\delta$ form a Galois connection between sets and bags of elements of type $A$ , ordered by subset $\subseteq$ and multiset inclusion $\leq$ orders respectively. That is, $\iota(M)\leq N\iff M\subseteq\delta(N)$ . In addition, $\delta\circ\iota=id$ .

In addition, we can observe the following relationship between the set and multiset operations:

[TABLE]

Together with identities established earlier, this shows that all of the set operations in $NRC(Set,Bag)$ can be simulated by $NRC(Bag)$ plus $\delta$ and $\iota$ . This allows us to translate SQL queries to terms with flat bag type.

2.1. SQL queries in $NRC(Set,Bag)$

We can show that $NRC(Set,Bag)$ is sufficiently powerful to express the SQL fragment including ${\color[rgb]{0,0,1}\mathtt{SELECT}}$ $[{\color[rgb]{0,0,1}\mathtt{DISTINCT}}]$ - ${\color[rgb]{0,0,1}\mathtt{FROM}}$ - ${\color[rgb]{0,0,1}\mathtt{WHERE}}$ clauses and ${\color[rgb]{0,0,1}\mathtt{UNION}}\;[{\color[rgb]{0,0,1}\mathtt{ALL}}]$ . Our translation assumes that table names $T$ are interpreted as free variables $x_{T}$ of a suitable bag type; we do not give an explicit translation of SQL terms and conditional expressions, but it is easy to express them as combinations of record field projections and $NRC$ constants. Also notice that our translation assumes that all terms in the ${\color[rgb]{0,0,1}\mathtt{SELECT}}$ clause and all subqueries in the ${\color[rgb]{0,0,1}\mathtt{FROM}}$ clause have been explicitly named using the ${\color[rgb]{0,0,1}\mathtt{AS}}$ keyword.

[TABLE]

${\color[rgb]{0,0,1}\mathtt{SELECT}}\;\star$ queries can also be expressed in $NRC(Set,Bag)$ by desugaring them to named ${\color[rgb]{0,0,1}\mathtt{SELECT}}$ queries.

2.2. Normalization

The translation of $NRC(Set,Bag)$ into SQL relies on the normalization of queries into an SQL-like fragment of the formalism by means of a set of rewrite rules: Fig. 1 shows a selection of the rules (standard rules for set and bag queries are in an appendix). Most of the rules are standard for set and bag queries respectively. Based on the fact that the rewrite rules for set and bag queries, when considered separately, are known to be strongly normalizing and preserve the meaning of expressions, and given that the rules for $\delta/\iota$ (where the mixing of sets and bags occurs) do not seem to be problematic, we believe our system to be terminating and to preserve the meaning of expressions; we do not know whether it enjoys confluence, but this property is not required (i.e. we do not require unique normal forms).

Fortuitously, $\delta M$ subterms can usually be simplified, and do not block other rules. On the other hand, $\iota M$ subterms can block other rewrite rules. This causes two problems. First, even if the result type of a query is flat, it might introduce nested structures internally. For homogeneous set or bag queries, these nested structures can be normalized away, but in mixed set–multiset queries, $\iota M$ can block rewrite rules needed to unnest a nested set-valued subquery $M$ . We therefore make a simplifying assumption that $\iota$ and $\delta$ are applied only to flat collections (sets or multisets of flat records) to avoid this complication.

Secondly, even with this constraint imposed, the normal form for bag-queries still allows set-queries $\iota P$ in several positions. In particular, it is unclear how to unnest set comprehensions within bag comprehensions:

[TABLE]

The normal form for bag-queries must therefore allow normalized set-queries $\iota P$ in several positions, particularly in comprehension generators $G$ . This implies that in a normalized term such as

[TABLE]

$x$ can actually appear free inside $P$ and be captured by the first generator. SQL disallows such dependencies between queries in the same ${\color[rgb]{0,0,1}\mathtt{FROM}}$ clause. For example, in the query

[TABLE]

it could be that $z$ appears in $P\cup P^{\prime}$ , but the analogous query

[TABLE]

is not valid SQL if $z$ appears in $P~{}{\color[rgb]{0,0,1}\mathtt{UNION}}~{}P^{\prime}$ .

The target fragment of NRC for flat queries with type $\{{\langle{\overrightarrow{\ell:T}}\rangle}\}$ is defined by the following grammar:

[TABLE]

By a similar reasoning, for multiset queries we can obtain normal forms described by the following grammar:

[TABLE]

Discussion

The normal forms $P$ of set queries can be directly translated to equivalent SQL, replacing $\cup$ with ${\color[rgb]{0,0,1}\mathtt{UNION}}$ , comprehensions and $\delta x$ (where $x$ is a table variable) with ${\color[rgb]{0,0,1}\mathtt{SELECT}}~{}{\color[rgb]{0,0,1}\mathtt{DISTINCT}}$ , and translating NRC record syntax to SQL style. We can see that unnesting of bag comprehension enclosed in a $\delta$ and used inside a set comprehension can be obtained as a derived rule:

[TABLE]

These normal forms suggest a limited form of conservativity which nevertheless appears practically useful:

Theorem 2.2.

Let $M$ be a query expression whose variables are all of flat collection type and whose result is a flat collection type, and where $\iota$ and $\delta$ are applied only to flat collections. Let $N$ be a normal form of $M$ : if there are no occurrences of $\iota$ inside multiset comprehensions in $N$ , then $N$ can be translated to SQL.

Let us note that no rewrite rule can move an $\iota$ into a multiset comprehension (this would not be the case if we were to add higher-order functions, however); then, if $M$ has no occurrences of $\iota$ inside bag comprehensions, its normal form also respects this property. We thus know that a sufficient (although not necessary) condition for unnormalized terms to be translatable to SQL is that they should not contain $\iota$ within a bag comprehension: this can be easily enforced by means of a syntactic check.

Examples

An e-commerce company active in several sectors including food and books records transactions independently for each of its departments, by means of tables $\mathit{FoodEvents}$ and $\mathit{BookEvents}$ both with attributes $\mathit{Id}$ and $\mathit{EventType}$ . The same transaction id can appear multiple times in the same table to record different events associated with it (e.g. “paid” or “shipped”), but ids in different tables live in different namespaces, so that, if an id in $\mathit{FoodEvents}$ and one in $\mathit{BookEvents}$ are equal, they still refer to different transactions. A query to collect all the transaction ids of transactions in both departments can be written in $NRC(Set,Bag)$ as follows:

[TABLE]

or equivalently, in SQL:

[TABLE]

The theorem’s side conditions are limiting, in that they do exclude certain queries that are straightforward to translate to SQL. For example, the following query performing a join between a bag query and a set query (using table variables $T$ , $U$ , and $V$ ) employs $\iota$ inside bag comprehension:

[TABLE]

We can however easily express the same operation in SQL:

[TABLE]

Notably, this translation works only because $x$ is not used in the generator for $y$ .

It is currently unclear if there exists a general method to normalize $NRC(Set,Bag)$ queries. We believe the second constraint can be lifted by decorrelating set-valued subqueries, but we do not have insight into how to handle $\iota/\delta$ applied to nested structures. Let us point out, however, that our result does allow the arbitrary nesting of bag and set queries (including the use of $\iota$ ) inside a top level set query, because the normal forms of set queries do not contain $\iota$ .

3. Conclusions

In this short paper we outline initial steps towards conservativity and normalization results that could provide a solid foundation for language-integrated query in the presence of mixed set and bag collections. The preliminary results in this paper provide criteria that ensure that mixed set–multiset queries mapping flat inputs to flat results can be translated to SQL, and which appear to cover many common cases. Our results also elucidate the forms of queries for which this translation is not as straightforward, and resolving their status will be the focus of future work.

Acknowledgments

This work was supported by ERC Consolidator Grant Skye (grant number Grant #682315).

Appendix A Type system

We show here the typing rules for $NRC(Set,Bag)$ , which we omitted from Section 2 due to space constraints: the symbol $\mathbf{B}$ stands for the Boolean type.

$\Gamma\vdash\emptyset:\{{T}\}$

$\Gamma\vdash M:T$

$\Gamma\vdash\{{M}\}:\{{T}\}$

$\Gamma\vdash M:\{{T}\}$ $\Gamma\vdash N:\{{T}\}$

$\Gamma\vdash M\cup N:\{{T}\}$

$\Gamma,x:T\vdash M:\{{S}\}$ $\Gamma\vdash N:\{{T}\}$

$\Gamma\vdash\bigcup\{{M|x\leftarrow N}\}:\{{S}\}$

$\Gamma\vdash\Lbag\Rbag:\Lbag{T}\Rbag$

$\Gamma\vdash M:T$

$\Gamma\vdash\Lbag{M}\Rbag:\Lbag{T}\Rbag$

$\Gamma\vdash M:\Lbag{T}\Rbag$ $\Gamma\vdash N:\Lbag{T}\Rbag$

$\Gamma\vdash M\uplus N:\Lbag{T}\Rbag$

$\Gamma,x:T\vdash M:\Lbag{S}\Rbag$ $\Gamma\vdash N:\Lbag{T}\Rbag$

$\Gamma\vdash\biguplus\Lbag{M|x\leftarrow N}\Rbag:\Lbag{S}\Rbag$

$\Gamma\vdash M:\Lbag{T}\Rbag$

$\Gamma\vdash\delta M:\{{T}\}$

$\Gamma\vdash M:\{{T}\}$

$\Gamma\vdash\iota M:\Lbag{T}\Rbag$

$\Gamma\vdash M:\mathbf{B}$ $\Gamma\vdash N:\{{T}\}$

$\Gamma\vdash\mathrm{where}_{\mathsf{set}}~{}M~{}\mathrm{do}~{}N:\{{T}\}$

$\Gamma\vdash M:\mathbf{B}$ $\Gamma\vdash N:\Lbag{T}\Rbag$

$\Gamma\vdash\mathrm{where}_{\mathsf{bag}}~{}M~{}\mathrm{do}~{}N:\Lbag{T}\Rbag$

Bibliography12

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. Buneman, S. Naqvi, V. Tannen, and L. Wong. Principles of programming with complex objects and collection types. Theor. Comput. Sci. , 149(1), 1995.
2[2] J. Cheney, S. Lindley, and P. Wadler. Query shredding: efficient relational evaluation of queries over nested multisets. In SIGMOD , pages 1027–1038. ACM, 2014.
3[3] E. Cooper. The script-writer’s dream: How to write great SQL in your own language, and be sure it will succeed. In DBPL , 2009.
4[4] E. Cooper, S. Lindley, P. Wadler, and J. Yallop. Links: web programming without tiers. In FMCO , 2007.
5[5] L. Fegaras and D. Maier. Optimizing object queries using an effective calculus. ACM Trans. Database Syst. , 25(4):457–516, 2000.
6[6] S. Fehrenbach and J. Cheney. Language-integrated provenance by trace analysis. In DBPL , 2019. To appear.
7[7] S. K. Lellahi and V. Tannen. A calculus for collections and aggregates. In CTCS , pages 261–280, 1997.
8[8] E. Meijer, B. Beckman, and G. M. Bierman. LINQ: reconciling object, relations and XML in the .NET framework. In SIGMOD , 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Mixing set and bag semantics

Abstract.

1. Introduction

2. Language overview

Proposition 2.1.

2.1. SQL queries in NRC(Set,Bag)NRC(Set,Bag)NRC(Set,Bag)

2.2. Normalization

Discussion

Theorem 2.2.

Examples

3. Conclusions

Acknowledgments

Appendix A Type system

2.1. SQL queries in $NRC(Set,Bag)$