
TL;DR
This paper explores the formal foundations of queries that combine set and bag semantics, introducing a calculus, rewrite rules, and translation considerations, highlighting open questions about their expressiveness.
Contribution
It formalizes the problem of mixed set and bag query semantics, proposes a calculus with rewrite rules, and characterizes queries translatable to SQL.
Findings
Introduces a calculus with set and multiset types
Provides rewrite rules for query normalization
Characterizes queries translatable to SQL
Abstract
The conservativity theorem for nested relational calculus implies that query expressions can freely use nesting and unnesting, yet as long as the query result type is a flat relation, these capabilities do not lead to an increase in expressiveness over flat relational queries. Moreover, Wong showed how such queries can be translated to SQL via a constructive rewriting algorithm. While this result holds for queries over either set or multiset semantics, to the best of our knowledge, the questions of conservativity and normalization have not been studied for queries that mix set and bag collections, or provide duplicate-elimination operations such as SQL's . In this paper we formalize the problem, and present partial progress: specifically, we introduce a calculus with both set and multiset collection types, along with natural mappings from sets to bags…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Mixing set and bag semantics
Wilmer Ricciotti
LFCS, School of InformaticsUniversity of Edinburgh
and
James Cheney
LFCS, School of InformaticsUniversity of Edinburgh and The Alan Turing Institute
(2019)
Abstract.
The conservativity theorem for nested relational calculus implies that query expressions can freely use nesting and unnesting, yet as long as the query result type is a flat relation, these capabilities do not lead to an increase in expressiveness over flat relational queries. Moreover, Wong showed how such queries can be translated to SQL via a constructive rewriting algorithm. While this result holds for queries over either set or multiset semantics, to the best of our knowledge, the questions of conservativity and normalization have not been studied for queries that mix set and bag collections, or provide duplicate-elimination operations such as SQL’s {\color[rgb]{0,0,1}\mathtt{SELECT}}~{}{\color[rgb]{0,0,1}\mathtt{DISTINCT}}. In this paper we formalize the problem, and present partial progress: specifically, we introduce a calculus with both set and multiset collection types, along with natural mappings from sets to bags and vice versa, present a set of valid rewrite rules for normalizing such queries, and give an inductive characterization of a set of queries whose normal forms can be translated to SQL. We also consider examples that do not appear straightforward to translate to SQL, illustrating that the relative expressiveness of flat and nested queries with mixed set and multiset semantics remains an open question.
language-integrated query, query normalization
††copyright: acmlicensed††price: 15.00††doi: 10.1145/3315507.3330202††journalyear: 2019††isbn: 978-1-4503-6718-9/19/06††conference: Proceedings of the 17th ACM SIGPLAN International Symposium on Database Programming Languages; June 23, 2019; Phoenix, AZ, USA††booktitle: Proceedings of the 17th ACM SIGPLAN International Symposium on Database Programming Languages (DBPL ’19), June 23, 2019, Phoenix, AZ, USA††ccs: Information systems Structured Query Language††ccs: Software and its engineering Functional languages
1. Introduction
The nested relational calculus (BNTW95, ) provides a principled foundation for integrating database queries into programming languages. Wong’s conservativity theorem (wong96jcss, ) generalized the classic flat-flat theorem (ParedaensG92, ) to show that for any nesting depth , a query expression over flat input tables returning collections of depth at most can be expressed without constructing intermediate results of nesting depth greater than . In the special case , this implies the flat-flat theorem, namely that a nested relational query mapping flat tables to flat tables can be expressed equivalently using the flat relational calculus.
In addition, Wong’s proof technique was constructive, and gave an easily-implemented terminating rewriting algorithm for normalizing NRC queries to equivalent flat queries; these normal forms correspond closely to idiomatic SQL queries and translating from the former to the latter is straightforward. The basic approach has been extended in a number of directions, including to allow for (nonrecursive) higher-order functions in queries (Cooper09, ), and to allow for translating queries that return nested results to a bounded number of flat relational queries (cheney14sigmod, ).
Normalization-based techniques are used in language-integrated query systems such as Kleisli (wong:comprehensions, ) and Links (CLWY06, ). Currently, language-integrated query systems such as C# and F# (meijer:sigmod, ) support duplicate elimination via a {\color[rgb]{0,0,1}\mathtt{DISTINCT}} keyword, which is translated to SQL queries in an ad hoc way, and comes with no guarantees regarding completeness or expressiveness as far as we know, whereas Database-Supported Haskell (DSH) (SIGMOD2015UlrichG, ) supports duplicate elimination but gives all operations list semantics and relies on more sophisticated SQL:1999 features to accomplish this. Fegaras and Maier (DBLP:journals/tods/FegarasM00, ) propose optimization rules for a nested object-relational calculus with set and bag constructs but do not consider the problem of conservativity with respect to flat queries.
Wong’s proof of conservativity also has the nice property that it relies on relatively weak properties of collection types. Thus, it applies both to set and multiset semantics; if we consider nested relational queries over sets, then we can translate to SQL queries using {\color[rgb]{0,0,1}\mathtt{SELECT}}\;{\color[rgb]{0,0,1}\mathtt{DISTINCT}} and {\color[rgb]{0,0,1}\mathtt{UNION}} operations that provide set semantics, while if we consider nested multiset queries we can instead generate plain {\color[rgb]{0,0,1}\mathtt{SELECT}} and {\color[rgb]{0,0,1}\mathtt{UNION}}\;{\color[rgb]{0,0,1}\mathtt{ALL}} operations that do not eliminate duplicates.
SQL itself maintains multiset semantics, but provides several operations that locally employ set semantics, such as {\color[rgb]{0,0,1}\mathtt{SELECT}}\;{\color[rgb]{0,0,1}\mathtt{DISTINCT}} and {\color[rgb]{0,0,1}\mathtt{UNION}}. In a database programming context, it seems natural to consider separate collection types for sets and multisets, so that it is clear from the type of a query expression whether the multiplicity matters. The ability to mix set and multiset queries would be beneficial for an accurate implementation of lineage for Links using the technique proposed by Fehrenbach and Cheney (fehrenbach19, ); however, the consequences of this on the expressiveness of the query language, and the conservativity of nested set/multiset queries over flat ones, do not appear to be well understood. This provides concrete motivation for our work.
In this paper we take some first steps towards conservativity and normal form results for mixed set/multiset queries. We introduce , a straightforward generalization of the nested relational calculus that contains two collection types (sets and bags), other standard constructs, and mappings from sets to bags and vice versa. The mapping from sets to bags simply coerces a set to a bag with the same elements, all with multiplicity 1. The mapping from bags to sets performs duplicate elimination: the set corresponding to a given bag consists of all elements of the bag with multiplicity . We next show that can express conjunctive SQL queries with {\color[rgb]{0,0,1}\mathtt{SELECT}}\;{\color[rgb]{0,0,1}\mathtt{DISTINCT}} and {\color[rgb]{0,0,1}\mathtt{UNION}}, illustrating how idiomatic SQL queries can be written in .
We then explore the equational rewriting opportunities afforded by . We recapitulate the standard rewriting laws of collection types in NRC, which apply both to sets and to bags individually. We also identify natural properties of and , particularly relating them to set and bag operations. The duplicate elimination operation has several convenient properties, because (as shown by Lellahi and Tannen (tannen, )) it is a monad morphism from the multiset to set monads. However, the converse operation has fewer convenient properties. Nevertheless, and do form a Galois connection between the sets and bags over a given type (ordered by the respective inclusion operations). Specifically, this means that calculates in some sense the optimal bag among all those that approximate a given set, and calculates in some sense the optimal set among all those that approximate a given bag. In fact, this Galois connection is a special case called a Galois insertion, which means that it satisfies , that is, if we convert a set to a bag and then eliminate duplicates we get back the original set exactly.
We next discuss the normal forms obtained by applying all possible rewrite rules until no more subexpressions are reducible. (We do not formally explore the termination of this system, but conjecture that it is terminating.) We identify normal forms that can be mapped directly to SQL queries, and give examples for which we do not yet know a systematic translation. Nevertheless, we are able to show a weak conservativity result that is of immediate practical interest: suppose we have queries over flat inputs and returning flat results. If we forbid the use of the operation inside bag comprehensions, then the normal form of any query in this sublanguage is straightforward to translate to SQL.
2. Language overview
We define as follows:
[TABLE]
Types include atomic types, record types with named fields, sets and bags. Terms include applied constants, conditional expressions, records with named fields, and various collection terms (empty, singleton, union, and comprehension). In this definition, ranges over variable names, over constants, and over record field names. Typing rules for collections are largely standard. We will allow ourselves to use sequences of generators in comprehensions, which are syntactic sugar for nested comprehensions, e.g.:
[TABLE]
We assume an intuitive denotational semantics interpreting these expressions as finite sets and bags, satisfying the following valid rules (among others):
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
We can immediately observe the following property about the semantics of and :
Proposition 2.1.
For any type , the operations and form a Galois connection between sets and bags of elements of type , ordered by subset and multiset inclusion orders respectively. That is, . In addition, .
In addition, we can observe the following relationship between the set and multiset operations:
[TABLE]
Together with identities established earlier, this shows that all of the set operations in can be simulated by plus and . This allows us to translate SQL queries to terms with flat bag type.
2.1. SQL queries in
We can show that is sufficiently powerful to express the SQL fragment including {\color[rgb]{0,0,1}\mathtt{SELECT}} [{\color[rgb]{0,0,1}\mathtt{DISTINCT}}]-{\color[rgb]{0,0,1}\mathtt{FROM}}-{\color[rgb]{0,0,1}\mathtt{WHERE}} clauses and {\color[rgb]{0,0,1}\mathtt{UNION}}\;[{\color[rgb]{0,0,1}\mathtt{ALL}}]. Our translation assumes that table names are interpreted as free variables of a suitable bag type; we do not give an explicit translation of SQL terms and conditional expressions, but it is easy to express them as combinations of record field projections and constants. Also notice that our translation assumes that all terms in the {\color[rgb]{0,0,1}\mathtt{SELECT}} clause and all subqueries in the {\color[rgb]{0,0,1}\mathtt{FROM}} clause have been explicitly named using the {\color[rgb]{0,0,1}\mathtt{AS}} keyword.
[TABLE]
{\color[rgb]{0,0,1}\mathtt{SELECT}}\;\star queries can also be expressed in by desugaring them to named {\color[rgb]{0,0,1}\mathtt{SELECT}} queries.
2.2. Normalization
The translation of into SQL relies on the normalization of queries into an SQL-like fragment of the formalism by means of a set of rewrite rules: Fig. 1 shows a selection of the rules (standard rules for set and bag queries are in an appendix). Most of the rules are standard for set and bag queries respectively. Based on the fact that the rewrite rules for set and bag queries, when considered separately, are known to be strongly normalizing and preserve the meaning of expressions, and given that the rules for (where the mixing of sets and bags occurs) do not seem to be problematic, we believe our system to be terminating and to preserve the meaning of expressions; we do not know whether it enjoys confluence, but this property is not required (i.e. we do not require unique normal forms).
Fortuitously, subterms can usually be simplified, and do not block other rules. On the other hand, subterms can block other rewrite rules. This causes two problems. First, even if the result type of a query is flat, it might introduce nested structures internally. For homogeneous set or bag queries, these nested structures can be normalized away, but in mixed set–multiset queries, can block rewrite rules needed to unnest a nested set-valued subquery . We therefore make a simplifying assumption that and are applied only to flat collections (sets or multisets of flat records) to avoid this complication.
Secondly, even with this constraint imposed, the normal form for bag-queries still allows set-queries in several positions. In particular, it is unclear how to unnest set comprehensions within bag comprehensions:
[TABLE]
The normal form for bag-queries must therefore allow normalized set-queries in several positions, particularly in comprehension generators . This implies that in a normalized term such as
[TABLE]
can actually appear free inside and be captured by the first generator. SQL disallows such dependencies between queries in the same {\color[rgb]{0,0,1}\mathtt{FROM}} clause. For example, in the query
[TABLE]
it could be that appears in , but the analogous query
[TABLE]
is not valid SQL if appears in P~{}{\color[rgb]{0,0,1}\mathtt{UNION}}~{}P^{\prime}.
The target fragment of NRC for flat queries with type is defined by the following grammar:
[TABLE]
By a similar reasoning, for multiset queries we can obtain normal forms described by the following grammar:
[TABLE]
Discussion
The normal forms of set queries can be directly translated to equivalent SQL, replacing with {\color[rgb]{0,0,1}\mathtt{UNION}}, comprehensions and (where is a table variable) with {\color[rgb]{0,0,1}\mathtt{SELECT}}~{}{\color[rgb]{0,0,1}\mathtt{DISTINCT}}, and translating NRC record syntax to SQL style. We can see that unnesting of bag comprehension enclosed in a and used inside a set comprehension can be obtained as a derived rule:
[TABLE]
These normal forms suggest a limited form of conservativity which nevertheless appears practically useful:
Theorem 2.2.
Let be a query expression whose variables are all of flat collection type and whose result is a flat collection type, and where and are applied only to flat collections. Let be a normal form of : if there are no occurrences of inside multiset comprehensions in , then can be translated to SQL.
Let us note that no rewrite rule can move an into a multiset comprehension (this would not be the case if we were to add higher-order functions, however); then, if has no occurrences of inside bag comprehensions, its normal form also respects this property. We thus know that a sufficient (although not necessary) condition for unnormalized terms to be translatable to SQL is that they should not contain within a bag comprehension: this can be easily enforced by means of a syntactic check.
Examples
An e-commerce company active in several sectors including food and books records transactions independently for each of its departments, by means of tables and both with attributes and . The same transaction id can appear multiple times in the same table to record different events associated with it (e.g. “paid” or “shipped”), but ids in different tables live in different namespaces, so that, if an id in and one in are equal, they still refer to different transactions. A query to collect all the transaction ids of transactions in both departments can be written in as follows:
[TABLE]
or equivalently, in SQL:
[TABLE]
The theorem’s side conditions are limiting, in that they do exclude certain queries that are straightforward to translate to SQL. For example, the following query performing a join between a bag query and a set query (using table variables , , and ) employs inside bag comprehension:
[TABLE]
We can however easily express the same operation in SQL:
[TABLE]
Notably, this translation works only because is not used in the generator for .
It is currently unclear if there exists a general method to normalize queries. We believe the second constraint can be lifted by decorrelating set-valued subqueries, but we do not have insight into how to handle applied to nested structures. Let us point out, however, that our result does allow the arbitrary nesting of bag and set queries (including the use of ) inside a top level set query, because the normal forms of set queries do not contain .
3. Conclusions
In this short paper we outline initial steps towards conservativity and normalization results that could provide a solid foundation for language-integrated query in the presence of mixed set and bag collections. The preliminary results in this paper provide criteria that ensure that mixed set–multiset queries mapping flat inputs to flat results can be translated to SQL, and which appear to cover many common cases. Our results also elucidate the forms of queries for which this translation is not as straightforward, and resolving their status will be the focus of future work.
Acknowledgments
This work was supported by ERC Consolidator Grant Skye (grant number Grant #682315).
Appendix A Type system
We show here the typing rules for , which we omitted from Section 2 due to space constraints: the symbol stands for the Boolean type.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. Buneman, S. Naqvi, V. Tannen, and L. Wong. Principles of programming with complex objects and collection types. Theor. Comput. Sci. , 149(1), 1995.
- 2[2] J. Cheney, S. Lindley, and P. Wadler. Query shredding: efficient relational evaluation of queries over nested multisets. In SIGMOD , pages 1027–1038. ACM, 2014.
- 3[3] E. Cooper. The script-writer’s dream: How to write great SQL in your own language, and be sure it will succeed. In DBPL , 2009.
- 4[4] E. Cooper, S. Lindley, P. Wadler, and J. Yallop. Links: web programming without tiers. In FMCO , 2007.
- 5[5] L. Fegaras and D. Maier. Optimizing object queries using an effective calculus. ACM Trans. Database Syst. , 25(4):457–516, 2000.
- 6[6] S. Fehrenbach and J. Cheney. Language-integrated provenance by trace analysis. In DBPL , 2019. To appear.
- 7[7] S. K. Lellahi and V. Tannen. A calculus for collections and aggregates. In CTCS , pages 261–280, 1997.
- 8[8] E. Meijer, B. Beckman, and G. M. Bierman. LINQ: reconciling object, relations and XML in the .NET framework. In SIGMOD , 2006.
