Functional Aggregate Queries with Additive Inequalities
Mahmoud Abo Khamis, Ryan R. Curtin, Benjamin Moseley, Hung Q. Ngo,, XuanLong Nguyen, Dan Olteanu, Maximilian Schleich

TL;DR
This paper introduces new algorithms and width parameters for efficiently answering functional aggregate queries with additive inequalities, with applications to machine learning tasks, improving over existing solutions.
Contribution
It defines relaxed width parameters and algorithms for FAQ-AI, extending prior work and enabling faster solutions for complex database queries with inequalities.
Findings
New width parameters for FAQ-AI with additive inequalities.
Algorithms achieving lower complexity than previous methods.
Applications to machine learning tasks like clustering and SVM training.
Abstract
Motivated by fundamental applications in databases and relational machine learning, we formulate and study the problem of answering functional aggregate queries (FAQ) in which some of the input factors are defined by a collection of additive inequalities between variables. We refer to these queries as FAQ-AI for short. To answer FAQ-AI in the Boolean semiring, we define relaxed tree decompositions and relaxed submodular and fractional hypertree width parameters. We show that an extension of the InsideOut algorithm using Chazelle's geometric data structure for solving the semigroup range search problem can answer Boolean FAQ-AI in time given by these new width parameters. This new algorithm achieves lower complexity than known solutions for FAQ-AI. It also recovers some known results in database query answering. Our second contribution is a relaxation of the set of polymatroids that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Functional Aggregate Queries with Additive Inequalities
Mahmoud Abo Khamis
relationalAI
Ryan R. Curtin
relationalAI
Benjamin Moseley
Carnegie Mellon University
Hung Q. Ngo
relationalAI
XuanLong Nguyen
University of Michigan
Dan Olteanu
University of Zurich
Maximilian Schleich
University of Washington
Abstract
Motivated by fundamental applications in databases and relational machine learning, we formulate and study the problem of answering functional aggregate queries (FAQ) in which some of the input factors are defined by a collection of additive inequalities between variables. We refer to these queries as FAQ-AI for short.
To answer FAQ-AI in the Boolean semiring, we define relaxed tree decompositions and relaxed submodular and fractional hypertree width parameters. We show that an extension of the InsideOut algorithm using Chazelle’s geometric data structure for solving the semigroup range search problem can answer Boolean FAQ-AI in time given by these new width parameters. This new algorithm achieves lower complexity than known solutions for FAQ-AI. It also recovers some known results in database query answering.
Our second contribution is a relaxation of the set of polymatroids that gives rise to the counting version of the submodular width, denoted by #subw. This new width is sandwiched between the submodular and the fractional hypertree widths. Any FAQ and FAQ-AI over one semiring can be answered in time proportional to #subw and respectively to the relaxed version of #subw.
We present three applications of our FAQ-AI framework to relational machine learning: -means clustering, training linear support vector machines, and training models using non-polynomial loss. These optimization problems can be solved over a database asymptotically faster than computing the join of the database relations.
1 Introduction
In this article we consider the problem of computing functional aggregate queries with additive inequalities, or FAQ-AI queries for short. Although existing algorithms such as InsideOut [6, 5] and PANDA [9, 8] are able to evaluate FAQ-AI queries, they do not exploit the structure of the additive inequalities. We introduce variants of these algorithms to this effect. Whereas the prior algorithms work on hypertree decompositions of the queries, our new algorithms work on relaxations of these decompositions to achieve lower computational complexities than InsideOut and PANDA.
Functional aggregate queries with additive inequalities can express computation needed for various database workloads and supervised and unsupervised machine learning.
On the database side, queries with inequalities occur naturally in scenarios involving temporal and spatial relationships between objects in databases. In a retail scenario (e.g., TPC-H), we would like to compute the revenue generated by a customer’s orders whose dates closely precede the ship dates of their lineitems. In streaming scenarios, we would like to detect patterns of events whose time stamps follow a particular order [16]. In spatial data management scenarios, we would like to retrieve objects whose coordinates are within a multi-dimensional range or in close proximity of other objects [31]. The evaluation of Core XPath queries over XML documents amounts to the evaluation of conjunctive queries with inequalities expressing tree relationships in the pre/post plane [20].
For machine learning, we show that FAQ-AI can express computation needed for -means clustering, training linear support vector machines, and training models using non-polynomial loss. These optimization problems can be solved over a database asymptotically faster than computing the join of the database relations.
1.1 Motivating examples
A key insight of this article is that the efficient computation of inequality joins can reduce the computational complexity of supervised and unsupervised machine learning.
Example 1.1**.**
The -means algorithm divides the input dataset into clusters of similar data points [24]. Each cluster has a mean , which is chosen according to the following optimization (similarity is defined here with respect to the norm):
[TABLE]
Let be the ’th component of mean vector . For a data point , the function computes the difference between the squares of the -distances from to and from to :
[TABLE]
A data point is closest to mean from the set of means iff .
To compute the mean vector , we need to compute the sum of values for each dimension over . If the dataset is the join of database relations over schemas , we can formulate this sum computation as a datalog-like query with aggregates [21]:
[TABLE]
The above notation means that the answer to query is the sum of over all tuples satisfying the conjunction on the right-hand side. Section 4 gives further queries necessary to compute the -means. As we show in this article, such queries with aggregates and inequalities can be computed asymptotically faster than the join defining . ∎
Simple queries can already highlight the limitations of state-of-the-art evaluation techniques, as shown next.
Example 1.2**.**
State-of-the-art techniques take time to compute the following query over relations of size :
[TABLE]
Examples 3.10 and 3.20 show how to compute and its counting version in time using the techniques introduced in this article.∎
1.2 The FAQ-AI problem
One way to answer the above queries is to view them as functional aggregate queries (FAQ) [6] formulated in sum-product form over some semiring. We therefore briefly introduce FAQ over a single semiring.
We first establish notation. For any positive integer , let . For , let denote a variable/attribute, and denote a value in the discrete domain of . For any , define , . That is, is a tuple of variables and is a tuple of values for these variables.
Consider a semiring . Let be a multi-hypergraph, which means that is a set of vertices and is a multiset111A multiset is a collection of elements each of which can occur multiple times. of edges where each edge is a subset of . To each edge we associate a function called factor. An FAQ query over one semiring with free variables has the form:
[TABLE]
Under the Boolean semiring , the query (2) becomes a conjunctive query: The factors represent input relations, where iff , with some notational overloading. Under the sum-product semiring, the query (2) counts the number of tuples in the join result for each tuple , where the factors are indicator functions . (The notation denotes the indicator function of the event in the semiring : if holds, and otherwise.) To aggregate over some input variable, say , we can designate an identity factor .
Throughout the article, we assume the query size to be a constant and state runtimes in data complexity. It is known [6] that over an arbitrary semiring, the query (2) can be answered in time , where is the size of the largest relation , fhtw denotes the fractional hypertree width of the query, and has no free variables [19]. If has free variables, fhtw-width becomes FAQ-width instead [6]. Here is the size of the largest factor . Over the Boolean semiring, the time can be lowered to [9], where subw is the submodular width [32] and hides a polylogarithmic factor in .
Motivated by the examples in Section 1.1, we formulate a class of FAQ queries called FAQ-AI:
Definition 1.3** (FAQ-AI).**
Given a hyperedge multiset that is partitioned into two multisets , where stands for “skeleton” and stands for “ligament”, the input to a query from the FAQ-AI class is the following:
To each hyperedge , there corresponds a function , as in the FAQ case. 2. 2.
To each hyperedge , there corresponds functions , one for every variable .
The output to the FAQ-AI query is the following:
[TABLE]
The summation is over tuples . The (uni-variate) functions can be user-defined functions, e.g., , or binary predicates with one key in and a numeric value, e.g., a table salary(employee_id, salary_value) where employee_id is a key. The only requirement we impose is that, given , the value can be accessed/computed in -time (in data complexity).
If , then we get back the FAQ formulation (2).
Example 1.4**.**
The queries in Section 1.1 are instances of (3):
[TABLE]
Note that for a given , can be computed in -time in data complexity, which in this context means when the number of dimensions is a constant. is over the sum-product semiring. can be over any semiring: Example 3.10 discusses the case of the Boolean semiring while Example 3.20 discusses the sum-product semiring. ∎
1.3 Our contributions
To answer FAQ queries of the form (2), currently there are two dominant width parameters: fractional hypertree width (fhtw [19]) and submodular width (subw [32]).222Section 2.1 overviews other notions of widths. It is known that for any query, and in the Boolean semiring we can answer (2) in -time [9, 32]. For non-Boolean semirings, the best known algorithm, called InsideOut [6, 7], evaluates (2) in time . For queries with free variables, fhtw is replaced by the more general notion of FAQ-width (faqw) [6]; however, for brevity we discuss the non-free variable case here.
Following [7], both width parameters subw and fhtw can be defined via two constraint sets: the first is the set TD of all tree decompositions of the query hypergraph , and the second is the set of polymatroids on vertices of . The widths subw and fhtw are then defined as maximin and respectively minimax optimization problems on the domain pair TD and , subject to “edge domination” constraints for . Section 2 presents these notions and other related preliminary concepts in detail.
Our contributions include the following:
Answering FAQ-AI over Boolean semiring
On the Boolean semiring, one way to answer query (3) is to apply the PANDA algorithm [32], using edge domination constraints on and the set TD of all tree decompositions of . However, we can do better. In Section 3.2 we define a new notion of tree decomposition: relaxed tree decomposition, in which the hyperedges in only have to be covered by adjacent TD bags. Then, we present a variant of the InsideOut algorithm running on these relaxed TDs using Chazelle’s classic geometric data structure [13] for solving the semigroup range search problem. We show that our InsideOut variant meets the “relaxed fhtw” runtime, which is the analog of fhtw on relaxed TD. The PANDA algorithm can use the InsideOut variant as a blackbox to meet the “relaxed subw” runtime. The relaxed widths are smaller than the non-relaxed counterparts, and are strictly smaller for some classes of queries, which means our algorithms yield asymptotic improvements over existing ones.
Answering FAQ over an arbitrary semiring
Next, to prepare the stage for answering FAQ-AI over an arbitrary semiring, in Section 3.3 we revisit FAQ over a non-Boolean semiring, where no known algorithm can achieve the subw-runtime. Here, we relax the set of polymatroids to a superset of relaxed polymatroids. Then, by adapting the subw definition to relaxed polymatroids, we obtain a new width parameter called “sharp submodular width” (#subw). We show how a variant of PANDA, called #PANDA, can achieve a runtime of for evaluating FAQ over an arbitrary semiring. We prove that , and that there are classes of queries for which #subw is unboundedly smaller than fhtw.
Answering FAQ-AI over an arbitrary semiring
Getting back to FAQ-AI, we apply the #subw result under both relaxations: relaxed TD and relaxed polymatroids, to obtain a new width parameter called the relaxed #subw. We show that the new variants of PANDA and InsideOut can achieve the relaxed #subw runtime. We also show that there are queries for which relaxed #subw is essentially the best we can hope for, modulo -sum-hardness.
Applications to relational Machine Learning
Equipped with the algorithms for answering FAQ-AI, in Section 4 we return to relational machine learning applications over training datasets defined by feature extraction queries over relational databases. We show how one can train linear SVM, -means, and ML models using Huber/hinge loss functions without completely materializing the output of the feature extraction queries. In particular, this shows that for these important classes of ML models, one can sometimes train models in time sub-linear in the size of the training dataset.
An early version of this work appeared in the proceedings of the 38th ACM Symposium on Principles of Database Systems (PODS’19) [1]. This article goes beyond that early version by extending the class of loss functions supported by our framework for relational machine learning, introducing new applications for our framework on the (probabilistic) database side, and including detailed proofs and derivation steps for various key results.
1.4 Related work
Appendix A revisits two prior results on the evaluation of queries with inequalities through FAQ-AI lenses: Core XPath queries over XML documents [18] and inequality joins over tuple-independent probabilistic databases [36].
Throughout the article, we contrast our new width notions with fhtw and subw and our new algorithm #PANDA with the state-of-the-art algorithms PANDA and InsideOut for FAQ and FAQ-AI queries.
Prior seminal work considers the containment and minimization problem for queries with inequalities [27]. The efficient evaluation of such queries continues to receive good attention in the database community [26]. There is a bulk of work on queries with disequalities (not-equal), which are at times referred to as inequalities. Queries with disequalities are a proper subclass of FAQ-AI (since can be represented as ). Prior works [28, 4] present several results for this proper subclass that are stronger than our general results for FAQ-AI in this work. In particular, for queries with disequalities it suffices to consider tree decompositions only for “skeleton” edges (ignoring “ligament” edges which -in this case- are the disequalities) [28, 4], whereas for the more general FAQ-AI we need to consider “relaxed” tree decompositions (see Def. 3.3).
Section 4 reviews relevant works on machine learning.
2 Preliminaries
We assume without loss of generality that semiring operations and can be performed in -time. (When the assumption does not hold, for the set semiring for instance, we can multiply the claimed runtime with the real operation’s runtime.)
2.1 Tree decompositions and polymatroids
We briefly define tree decompositions, fhtw and subw parameters. We refer the reader to the recent survey by Gottlob et al. [17] for more details and historical contexts. In what follows, the hypergraph should be thought of as the hypergraph of the input query, although the notions of tree decomposition and width parameters are defined independently of queries.
A tree decomposition of a hypergraph is a pair , where is a tree whose nodes are and maps each node of the tree to a subset of vertices such that
every hyperedge is a subset of some , (i.e. every edge is covered by some bag), 2. 2.
for every vertex , the set is a non-empty (connected) sub-tree of . This is called the running intersection property.
The sets are called the bags of the tree decomposition.
Let denote the set of all tree decompositions of . When is clear from context, we use TD for brevity.
To define width parameters, we use the polymatroid characterization from Abo Khamis et al. [9]. A function is called a (non-negative) set function on . A set function on is modular if for all , monotone if whenever , and submodular if for all . A monotone, submodular set function with is called a polymatroid. Let denote the set of all polymatroids on .
Given , define the set of edge dominated set functions:
[TABLE]
We next define the submodular width and fractional hypertree width of a given hypergraph :
[TABLE]
It is known [32] that , and there are classes of hypergraphs with bounded subw and unbounded fhtw. Furthermore, fhtw is strictly less than other width notions such as (generalized) hypertree width and tree width.
Remark 2.1**.**
Prior to Abo Khamis et al. [9], the commonly used definition of is [19]
[TABLE]
where is the fractional edge cover number of a vertex set using the hyperedge set . It is straightforward to show, using linear programming duality [9], that
[TABLE]
proving the equivalence of the two definitions. However, the characterization (6) has two primary advantages: (i) it exposes the minimax / maximin duality between fhtw and subw, and more importantly (ii) it makes it completely straightforward to relax the definitions by replacing the constraints by other applicable constraints, as shall be shown in later sections.∎
Definition 2.2** **(-connex tree
Given a hypergraph and a set , a tree decomposition of is -connex if there is a subset that forms a connected subtree of and satisfies . (Note that could be empty.)
We use to denote the set of all -connex tree decompositions of . (Note that when , .)
Definition 2.3** (Non-redundant tree decomposition).**
A tree decomposition is redundant if there are where . A tree decomposition is non-redundant if it is not redundant.
The following proposition is folklore. For completeness, we prove it in Appendix B.
Proposition 2.4**.**
For every tree decomposition of a query , there exists a non-redundant tree decomposition of that satisfies
[TABLE]
Moreover, if is -connex, then can be chosen to be -connex as well.
Based on the above proposition, we only need to consider non-redundant tree decompositions in (6) and (7) (and later on in (10) and (13)).
2.2 InsideOut and PANDA
To answer the FAQ query (2), we need a model for the representation of the input factors . The support of the function is the set of tuples such that . We use to denote the size of its support. For example, if represents an input relation, then is the number of tuples in . In practice, there often are factors with infinite support, e.g., represents a built-in function in a database, an arithmetic operator, or a comparison operator as in (3). To deal with this more general setting, the edge set is partitioned into two sets , where is finite for all and for all . For simplicity, we often state runtimes of algorithms in terms of the “input size” . Moreover, we use to denote the output size of . We always assume that ; otherwise the output size could be infinite.
InsideOut [6, 5, 7]
To answer (2), the InsideOut algorithm works by eliminating variables, along with an idea called the “indicator projection” (see Appendix C for more details). The runtime is described by the FAQ*-width* of the query, a slight generalization of fhtw. For one semiring, we can define by applying Definition (6) over a restricted set of tree decompositions and edge dominated polymatroids. In particular, let denote the set of free variables in (2), and recall from Definition 2.2. Then,
[TABLE]
Note that when and (i.e. ). A simple result from Abo Khamis et al. [6] is the following: (Recall that throughout the article we assume the query size to be a constant and state runtimes in data complexity.)
Theorem 2.5** ([6, 5]).**
InsideOut* answers query (2) in time .*
A proof sketch of the above theorem can be found in Appendix C. To solve the FAQ-AI (3), we can apply Theorem 2.5 with since all ligament factors are infinite. But this is suboptimal—later, we show a new InsideOut variant that is polynomially better.
PANDA [9, 8]
For the Boolean semiring, i.e., when the FAQ query (2) is of the form
[TABLE]
we can do much better than Theorem 2.5. When , Marx [32] showed that (12) can be answered in time . The PANDA algorithm [9, 8] generalizes Marx’s result to deal with general degree constraints, and to meet precisely the -runtime (see Appendix D for more details). In fact, PANDA works with queries such as (12) with free variables as well. In the context of this article, we can define the following notion of submodular FAQ-width in a natural way:
[TABLE]
Then, the results from Abo Khamis et al. [9] imply:
Theorem 2.6** ([9, 8]).**
PANDA* answers query (12) in time .*
Appendix D presents an overview of the core PANDA algorithm and its analysis. The PANDA results only work for the Boolean semiring. Section 3 introduces a variant of PANDA, called #PANDA, that also works for non-Boolean semirings.
2.3 Semigroup range searching
Orthogonal range counting (and searching) is a classic and ubiquitous problem in computational geometry [15]: given a set of points in a -dimensional space, build a data structure that, given any -dimensional rectangle, can efficiently return the number of enclosed points. More generally, there is the semigroup range searching problem [13], where each point of the input points also has a weight , where is a semigroup.333In a semigroup we can add two elements using , but there is no additive inverse. The problem is: given a -dimensional rectangle , compute .
Classic results by Chazelle [13] show that there are data structures for semigroup range searching which can be constructed in time , and answer rectangular queries in -time. Also, this is almost the best we can hope for [14]. There are more recent improvements to Chazelle’s result (see, e.g., Chan et al. [12]), but they are minor (at most a factor), as the original results were already very close to matching the lower bound.
Most of these range search/counting problems can be reduced to the dominance range searching problem (on semigroups), where the query is represented by a point , and the objective is to return . Here, denotes the “dominance” relation (coordinate-wise ). We can think of as the lower-corner of an infinite rectangle query.
3 Relaxed tree decompositions and relaxed polymatroids
3.1 Connection to semigroup range searching
We always assume that ; otherwise the output size could be infinite. We start with a special case of (3) in which the skeleton part contains only two hyperedges and . Consider the aggregate query of the form
[TABLE]
where and are two input functions/relations over variable sets and , respectively. We prove the following simple but important lemma:
Lemma 3.1**.**
Let , and . For , query (14) can be answered in time .
Proof.
If there is a hyperedge for which , then in a -time pre-processing step we can “absorb” the factor into the factor , by replacing with the product . In particular, this product can be computed by iterating over tuples satisfying and for each such tuple , testing whether the inequality holds. If it does, then the indicator takes a value of , hence the value of remains unchanged after the product. Otherwise, both the indicator and its product with take a value of . A similar absorption can be done with . Hence, without loss of generality we can assume that and for all .
Moreover, we only need to show that we can compute (14) for , because after is computed, we can “aggregate away” variables in -time by computing the aggregation:
[TABLE]
The above aggregation can be computed by sorting tuples satisfying lexicographically based on so that tuples sharing the same -prefix become consecutive. Then for each distinct -prefix, we aggregate away over all tuples sharing that prefix.
Abusing notation somewhat, for each and each , define the function by
[TABLE]
Fix a tuple such that . A tuple is said to be -adjacent if . We show how to compute the following sum in poly-logarithmic time:
[TABLE]
where the inner sum ranges only over tuples which are -adjacent. This is because the value of has been fixed and tuples that are not -adjacent are inconsistent with the fixed value of .
Now, for the fixed and for each define the following -dimensional points:
[TABLE]
We write to say that is dominated by coordinate-wise: . Assign to each point a “weight” of . Now, taking (17),
[TABLE]
(The equality used above follows from the definition of the component-wise .) The expression thus computes, for a given “query point” , the weighted sum over all points that dominate the query point. This is precisely the dominance range counting problem, which—modulo a -preprocessing step—can be solved in time [13], as reviewed in Section 2.3.
∎
Example 3.2**.**
Let be a binary relation. Suppose we want to count the number of tuples satisfying . By setting , , , the problem can be reduced to the form (14) with , . We can thus compute this count in time .∎
3.2 Relaxed tree decompositions
Equipped with this basic case, we can now proceed to solve the general setting of (3). To this end, we define a new width parameter.
Definition 3.3** (Relaxed tree decomposition).**
Let denote a multi-hypergraph whose edge multiset is partitioned into and . A relaxed tree decomposition of (with respect to the partition ) is a pair , where is a tree whose nodes and edges are and respectively, and satisfies the following properties:
- (a)
The running intersection property holds: for each node the set is a connected subtree in .
- (b)
Every “skeleton” edge is covered by some bag , .
- (c)
Every “ligament” edge is covered by the union of two adjacent bags and , i.e. , where .
Let denote the set of all relaxed tree decompositions of (with respect to the skeleton-ligament partition). When is clear from context we use for the sake of brevity. Given , let denote the set of all relaxed -connex tree decompositions of .
The new condition (c) in the above definition is needed so that later we can utilize Lemma 3.1 to compute aggregate queries over the relaxed tree decomposition. In particular, the two adjacent bags and in condition (c) will play the role of and from Lemma 3.1 and the corresponding query (14).
3.2.1 FAQ-AI on a general semiring
We use relaxed TDs in conjunction with Lemma 3.1 to answer FAQ-AI with a relaxed notion of faqw. In particular, the relaxed width parameters of are defined in exactly the same way as the usual width parameters defined in Section 2, except we allow the TDs to range over relaxed ones.
Definition 3.4** (Relaxed faqw).**
Let be an FAQ-AI query (3), and be its hypergraph. Furthermore, let denote the set of hyperedges for which . Then, the relaxed FAQ-width of is defined by
[TABLE]
When , collapses to which is the relaxed fhtw for FAQ-AI without free variables:
[TABLE]
A relaxed tree decomposition of is optimal if its width is equal to , i.e.,
[TABLE]
Theorem 3.5**.**
Any FAQ-AI query of the form (3) on any semiring can be answered in time , where is the maximum number of additive inequalities covered by a pair of adjacent bags in an optimal relaxed tree decomposition.444Note that can be a lot smaller than since different additive inequalities can be covered by different pairs of adjacent bags in an optimal relaxed hypertree decomposition.
Proof.
We first consider the case of no free variables because this case captures the key idea. Fix an optimal relaxed tree decomposition . We first compute, for each bag of the tree decomposition, a factor such that
[TABLE]
To define the factors , we need the notion of indicator projection [7, 5, 6]; see Appendix C for some background about the InsideOut algorithm where this notion was originally developed.
Definition 3.6** (Indicator Projection [6, 5, 7]).**
For a given and such that , the indicator projection of onto the set is a function defined by
[TABLE]
Based on the above definition, it is easy to verify that for any and such that , we have the identity
[TABLE]
Recall from Definition 3.3 that every is covered by at least one bag for . Fix an arbitrary coverage assignment , where is covered by the bag . Then, the factors are defined by:
[TABLE]
Claim 1**.**
The factors defined by (26) satisfy (23).
The above claim can be proved as follows:
[TABLE]
For every , the query can be reduced to a join query and solved using a worst-case optimal join algorithm [34, 35, 43] as follows. For every where , define to be the support of the factor , which is the set of tuples satisfying :
[TABLE]
can be viewed as a relation over variables . Computing can be reduced to solving the join query defined as:
[TABLE]
This is because once we solve the join query , the factor can be computed as follows:
[TABLE]
where above denotes the output of the join query . The join query can be computed using a worst-case optimal join algorithm in time
[TABLE]
Over all , our runtime is bounded by , where
[TABLE]
Moreover for every , the output size of the join query is bounded by , thanks to the AGM bound [10, 19].
Next we compute (23) in time . We will make use of the fact that is a relaxed TD. Fix an arbitrary root of the tree decomposition ; following InsideOut (Appendix C), we compute (23) by eliminating variables from the leaves of up to the root. Thanks to Proposition 2.4, we can assume the tree decomposition to be non-redundant. Let be any leaf of , be its parent, where and . Because of non-redundancy, we have . Now write (23) as follows:
[TABLE]
The third equality uses the semiring’s distributive law. (Note that implies that thanks to Definition 3.3 and the fact that is the only neighbor of .) Lemma 3.1 implies that we can compute the sub-query from (32) in the allotted time. The above step eliminates all variables in . In particular after this step, the original query from (23) becomes:
[TABLE]
The above is an FAQ of the same form (3) except that it no longer involves the variables (Recall that ). It admits a tree decomposition that results from the original tree decomposition by removing the leaf . In particular, the new factor in (33) is covered by the bag and all other properties of tree decompositions continue to hold after the removal of . By induction on the number of variables, we solve the new query (33) in time . Induction completes the proof. (In the base case, we have a query with no variables where the theorem holds trivially.)
When the query has free variables, the algorithm proceeds similarly to the case of an FAQ with free variables [6, 5]. See Appendix C for a recap of how to handle free variables in an FAQ. ∎
Example 3.7**.**
Given three binary relations and , consider a query that counts the number of tuples that satisfy:
[TABLE]
The query has and . Let . Note that . In fact, any of the previously known algorithms, e.g. [6, 7], would take time to answer . However, this query has , and by Theorem 4, it can be answered in time . (Note that here .) An optimal relaxed tree decomposition is shown in Figure 1.∎
We next give a couple of simple lower and upper bounds for . The upper bound shows that, effectively is the best we can hope for, if the FAQ-AI query is arbitrary. The lower bound shows that, while the relaxed tree decomposition idea can improve the runtime by a polynomial factor, it cannot improve the runtime over straightforwardly applying InsideOut (over non-relaxed tree decompositions) by more than a polynomial factor.
Proposition 3.8**.**
For any positive integer , there exists an FAQ-AI query of the form (3) for which , and it cannot be answered in time , modulo -sum hardness.
Proof.
It is widely assumed [37, 30] that is the best runtime for -sum, which is the following problem: given number sets of maximum size , determine whether there is a tuple such that . We can reduce -sum to our problem: Consider the query over the Boolean semiring:
[TABLE]
The answer to is true iff there is a tuple such that . The reduction shows that our query (35) is -sum-hard. For this query, .
∎
Proposition 3.9**.**
For any FAQ-AI query of the form (3), we have ; in particular, when has no free variables .
Proof.
Let denote a relaxed tree decomposition of with fractional hypertree width . Construct a new (non-relaxed) tree decomposition for as follows. Each vertex in is also a vertex in with . Moreover, to each edge there corresponds an additional vertex in whose bag is . As for the edge set of , for each edge , there are two corresponding edges in , namely and . We can verify that is a (non-relaxed) tree decomposition of . Moreover because each bag of is covered by at most two bags of , the FAQ-width of is at most . Finally, if is -connex, then so is . ∎
3.2.2 FAQ-AI on the Boolean semiring
Before explaining how we can adapt PANDA to solve an FAQ-AI query on the Boolean semiring, we give the intuition with an example.
Example 3.10**.**
Consider the following FAQ-AI:
[TABLE]
Here . Using fractional hypertree width measure and InsideOut (even with relaxed TDs and Theorem 4), the best runtime is , because no matter which (relaxed) TD we choose, the worst-case bag relation size is . However the PANDA framework [9, 8] can solve many queries, including this one, in time smaller than the FAQ-width. At a very high level, the way PANDA achieves this is by carefully partitioning the input data and then choosing a possibly different tree decomposition for each part. Query (36) accepts two non-redundant and non-trivial555A tree decomposition is trivial if it consists of only one bag containing all the variables. relaxed tree decompositions. The first tree decomposition consists of the bags and while the second has the bags and . The PANDA framework utilizes both tree decompositions simultaneously to solve this query. In particular, for each tuple satisfying the body of query (36), we make sure that this tuple is “captured by” at least one of the two tree decompositions in the sense that it will reported by a query over this tree decomposition. We realize this intuition using the following disjunctive Datalog rule:
[TABLE]
In the above rule, there are two relations in the head and , and they form a solution to the rule iff the following holds: if satisfies the body, then either or . Via information-theoretic inequalities [9, 8], we are able to show that PANDA can compute a solution to the above disjunctive Datalog rule in time . In particular, both and are bounded by .
Given such a solution to (37) (which is not necessarily unique), it is straightforward to verify that the following also holds, using the distributivity of over :
[TABLE]
By semijoin-reducing against and (i.e. by replacing with ), and similarly by semjoin-reducing against and , we conclude that
[TABLE]
Finally, we have a rewrite of the original body:
[TABLE]
The above captures precisely our intuition that every tuple satisfying the body of (36) should be reported by either one of the two relaxed tree decompositions. By defining intermediate rules, we can compute from them:
[TABLE]
and are of the form (14), and thus they each can be answered in -time (since ). This implies that can be answered in -time overall.∎
The strategy outlined in the above example uses PANDA to evaluate an FAQ-AI query over the Boolean semiring. The resulting algorithm achieves a natural generalization of the submodular FAQ-width defined in (13):
Definition 3.11**.**
Given an FAQ-AI query (3) over the Boolean semiring. The relaxed submodular FAQ-width of is defined by
[TABLE]
(Recall that the set of relaxed tree decompositions was defined in Definition 3.3.)
Theorem 3.12**.**
Any FAQ-AI query of the form (3) on the Boolean semiring can be answered in time .
Proof.
As in the proof of Theorem 4, we first assume there are no free variables; the generalization to is a straightforward generalization of techniques developed in [6, 5] and reviewed in Appendix C. When , the query (3) is written in Datalog as:
[TABLE]
We write instead of and instead of to avoid clutter. It will be implicit throughout this proof that the subscript of a factor/function indicates its arguments. To answer query (44), the first step is to find one relation (over variables ) for every bag of every relaxed tree decomposition such that the relations together form a solution to the following equation:
[TABLE]
Note that the right-hand side of (45) is a Boolean tensor decomposition of the left-hand side: In particular under the Boolean semiring , the left-hand side of (45) can be viewed as an -dimensional tensor where while the right-hand side is an equivalent sum of a product of tensors. The idea of using Boolean tensor decomposition to speed up query evaluation was used in the context of queries with disequalities [4]. Assuming that we can compute the intermediate relations efficiently satisfying (45), then (44) can be answered by answering for each an intermediate query:
[TABLE]
The final answer is obtained by the Datalog rule:
[TABLE]
The key point here is that each intermediate query (46) is an FAQ-AI query (3) with . 666We can also show here that is exactly 1 although this is not needed for the proof of Theorem 3.12. In particular, by comparing (20) to (6), we can see that for any query , , and fhtw for any hypergraph is at least [19]. This is because admits a relaxed tree decomposition where each bag for is covered by one relation , hence . By Theorem 4 each intermediate query (46) can be answered in time where
[TABLE]
It remains to show how to compute tables that form a solution to (45); to do so, we apply distributivity of over to rewrite the right-hand side of (45) as follows. Let be the collection of all maps such that for some ; in other words, selects one bag out of each tree decomposition . Then, from the distributive law we have
[TABLE]
which means to solve the relational equation (45) we can instead solve the equation
[TABLE]
To solve the above equation, for each we can find tables that form a solution to the following equation
[TABLE]
To do that, for each , we compute a solution to the following disjunctive Datalog rule:
[TABLE]
Once we obtain the relations , we can semijoin-reduce them against the input relations (i.e. replace with for each input relation where ), in order to obtain that solve (50). Once we obtain those , we plug them in (46) to obtain an FAQ-AI query of the form (3) for each relaxed tree decomposition . We use Theorem 4 to solve each one of those queries in time where was given by (48). This is the step of the algorithm where the additive inequalities participate in the computation. Once we obtain the solutions to queries (46), we use (47) to obtain the answer of the original FAQ-AI query.
The only step in the above algorithm that we haven’t specified yet is how to evaluate each disjunctive Datalog rule (52). We do so by running the PANDA algorithm, which computes the rule in time bounded by , where
[TABLE]
Maximizing over , the runtime is bounded by , where
[TABLE]
The first equality in (57) follows from the minimax lemma in [8]. Our reasoning above also shows that from (48) is bounded by . ∎
3.3 Relaxed polymatroids
A key step in the proof of Theorem 3.12 is to find the Boolean tensor decomposition (45) of the product over . In a non-Boolean semiring, this becomes a tensor decomposition on this semiring:
[TABLE]
In order to compute this tensor decomposition, we can still follow the script of the proof of Theorem 3.12, working on the parameter space of the input factors ; however, for the equality in (58) to hold (it is an identity over the value-space of the factors), it suffices to ensure the following property:
For any s.t. , there is exactly one tree decomposition for which
[TABLE]
while for the other TDs, the left-hand side above is .
Essentially, the property ensures that we do not have to perform inclusion-exclusion (IE) over the tree decompositions in .777IE is difficult for two reasons: (1) IE computation explodes the runtime, and (2) in a general semiring there may not be additive inverses and thus IE may not even apply. We do not know how to ensure this property in general. However, under a relaxed notion of polymatroids, the property above holds. Since this idea applies to FAQ queries in general, we start with our result on FAQ queries first, before specializing it to FAQ-AI.
3.3.1 FAQ over an arbitrary semiring
To explain how we can guarantee the property (59) for an FAQ query over an arbitrary semiring, consider the following example. Suppose that we would like to evaluate the (aggregate) query
[TABLE]
We write instead of for short. The factors are functions of two variables , and they are represented by ternary relations in a database. Abusing notation we will also use to refer to its support, i.e., the binary relation over such that iff .
There are only two non-trivial tree decompositions for the “-cycle” query (60): one with bags and , and the other with bags and .888The trivial TD with one bag can always be replaced by a non-trivial TD in the considered bounds/algorithms without making them any worse. Similarly, redundant TDs can be replaced by non-redundant ones. To evaluate the query, we first solve the relational equation (58), but only on the supports; i.e., we would like to find relations , , and such that
[TABLE]
The second is due to the distributivity of over . Since the last formula is in CNF, we can solve each term separately by solving different disjunctive Datalog rules:
[TABLE]
Applying the proof-to-algorithm conversion idea from PANDA [9, 8], the above disjunctive Datalog rules can be solved with the PANDA algorithm. It is beyond the scope of this article to describe the PANDA algorithm in full details. However, we can describe a solution. Let . For each input relation/factor, define their “light” parts as follows.
[TABLE]
Also, for every , define . Then, one can verify that the following is a solution to the relational equations (62)-(65):
[TABLE]
The above is not yet solution to (61). However we can refine it as follows to obtain such a solution:
[TABLE]
(These extra relations that are joined into to turn them into a solution to (61) will be referred to as “filters” in the proof of Theorem 3.15 below.) It is straightforward to verify that each can be computed in -time. However, (61) alone is not enough to guarantee (59). Instead, we now need to satisfy the following stronger condition (where \bigvee$$\scriptscriptstyle{+} denotes the exclusive OR):
[TABLE]
Luckily, in this particular example, our previous solution for from (67) happens to be a solution to (68) as well. Once we have the relations from (67), we can extend them naturally into factors (so that they are represented by -ary relations) satisfying (59). In particular, as functions with range , they are defined by
[TABLE]
Finally the query from (60) can be computed by taking the sum of two queries:
[TABLE]
The above sketch does not work for a general FAQ query because the relational solution returned by PANDA is not guaranteed to satisfy (59). (If we could do that, then we would have been able to solve queries in submodular width time, but the latter is unlikely to be possible since the submodular width tightly characterizes the hardness of CSP queries [32].) We could however restrict PANDA forcing it to maintain (59) at the cost of weakening the runtime bound achieved by PANDA. In particular, PANDA’s runtime is upperbounded by the submodular (FAQ) width, which is a maximum over some set of polymatroids (See Section 2.2). We will now replace these polymatroids with a superset, called -polymatroids, leading to a larger version of the submodular (FAQ) width called “sharp submodular (FAQ) width”. The latter captures the runtime of our new version of PANDA, called #PANDA.
Definition 3.13** (-polymatroids and ).**
Given a collection of subsets of , a set function is said to be a -polymatroid if it satisfies the following: (i) , (ii) whenever , and (iii) for every pair such that for some ****.999The underlined part is the only distinction between -polymatroids and polymatroids. If we drop it, we get back the original definition of polymatroids. In particular, a -polymatroid is a polymatroid as defined in Section 2.1. For , let denote the set of all -polymatroids on .
The following definition is a straightforward generalization of smfw from (13), where we replace by the relaxed polymatroids .
Definition 3.14** (#-submodular FAQ-width).**
Given an FAQ query (2) whose hypergraph is , its #-submodular FAQ-width, denoted by , is defined by
[TABLE]
When there are no free variables, i.e., , we define , to mirror the case when .
Under the above new width parameter, we can now maintain condition (59) allowing us to solve FAQ queries over any semiring:
Theorem 3.15**.**
Any FAQ query of the form (2) on any semiring can be answered in time .
The proof of Theorem 3.15 involves an appropriate adaptation of PANDA called #PANDA, to be described below. Appendix D presents an overview of the original PANDA algorithm. Readers unfamiliar with PANDA are recommended to read that appendix first before reading the following proof.
Proof.
The PANDA algorithm [9, 8] takes as input a disjunctive Datalog query of the form
[TABLE]
The above query has an input relation for each hyperedge in the query’s hypergraph . The output to the above query is a collection of tables , one for each “goal” (or “target”) in the collection of goals . The output tables must satisfy the logical implication in (72): In particular, for each tuple that satisfies the conjunction , the disjunction must hold. Query (37) is an example of (72). A disjunctive Datalog query (72) can have many valid outputs. The PANDA algorithm computes one such output in time , where
[TABLE]
(Recall notation from Section 2.2.)
In what follows, we describe a variant of PANDA, called #PANDA, that takes a disjunctive Datalog query (72), and computes the following:
- •
A collection of tables that form a valid output to query (72), i.e. that satisfy the logical implication in (72).
- •
Moreover, associated with each output table , #PANDA additionally computes a collection of “filter” tables , one table for each hyperedge in the input hypergraph . The output tables along with the associated filters satisfy the following condition: For each tuple that satisfies the conjunction , there is exactly one target where the conjunction holds, and for that target , holds as well. In particular, the following equivalences hold:
[TABLE]
where \bigvee$$\scriptscriptstyle{+} above denotes the exclusive OR. Equations (74) and (75) together imply
[TABLE]
Comparing the above to (72), note that the purpose of the filters is to keep the goals disjoint from one another allowing us to replace with \bigvee$$\scriptscriptstyle{+} and ultimately maintain condition (58). (As we will see later, in #PANDA, we start with filters that are identical to the corresponding input relations , and we keep removing tuples from to maintain (74) and (75) throughout the algorithm.)
#PANDA computes the above output tables and in time where
[TABLE]
Now we briefly explain how to tweak the PANDA algorithm into #PANDA satisfying the above characteristics. We refer the reader to Appendix D and [9, 8] for more details about PANDA. At a high level, the PANDA algorithm starts with proving an exact upperbound on from (73) using a sequence of proof steps, called the proof sequence (see Lemmas 135, D.3, and D.5). Then PANDA interprets each step in the proof sequence as a relational operator, and then uses this sequence of relational operators as a query plan to actually compute the query in time . One of the proof steps used in PANDA is the decomposition step for some . The relational operator corresponding to this decomposition step is the “partitioning” operator, in which we take an input (or intermediate) table and partition it into a small number of tables , based on the degrees of variables in with respect to variables in . In particular, define the degree of w.r.t. a tuple and w.r.t. to as follows:
[TABLE]
In the partitioning step, we partition tuples into buckets based on and partition accordingly. Specifically, for each , we define
[TABLE]
After partitioning, PANDA creates independent branches of the problem, where in the -th branch, is replaced by both and . Note that for each , the following holds:
[TABLE]
The above inequality mirrors the proof step exemplifying the way the entire PANDA algorithm mirrors the proof sequence of the bound in (73) allowing its runtime to be bounded by (73) (see [9, 8] for more details). After each partitioning step, PANDA continues on each one of the branches of the problem independently and ends up computing a potentially different target for some within each branch.
From the proof sequence construction described in [9, 8], we note the following: If the constructed proof sequence that is used to prove the bound on in (73) contains a decomposition step , then the proof of the bound on must have relied on some submodularity constraint on of the form for some where . In particular, such a submodularity can be broken down into the sum of two inequalities:
[TABLE]
which in turn are converted into two proof steps in the proof sequence:
[TABLE]
Moreover, the above is the only place in the proof sequence construction [9, 8] where a decomposition step (85) is introduced. However, the new bound (77) used in #PANDA only relies on submodularities where for some . (Recall from Definition 9.) Therefore, in #PANDA, whenever we apply a partitioning step of into based on the degrees of , we know that there is some input relation with . Therefore we can refine the corresponding filter by semijoining it with on the -th branch, i.e. by taking . Moreover, this update of filters maintains (74) and (75). (Initially, we start with filters that are identical to the corresponding input relations , which trivially satisfy both (74) and (75).)
Now that we have described the #PANDA algorithm satisfying the above properties, we explain how to use it as a blackbox to solve an FAQ query of the form (2) in time . Following the same notation as in the proof of Theorem 3.12, let be the collection of all maps such that for some ; in other words, selects one bag out of each tree decomposition . Let be the collection of images of all , i.e.
[TABLE]
For each , we use #PANDA to solve the following rule (i.e. to produce relations and that satisfy the equivalence):
[TABLE]
The solutions collectively satisfy the following:
[TABLE]
Let and suppose . By distributing the conjunction over \bigvee$$\scriptscriptstyle{+} , we get
[TABLE]
Using the same diagonalization argument from [9, 8], we can prove the following claim:
Claim 2**.**
For every , there must exist a tree decomposition such that for every , for some .
Assuming Claim 2 is correct, and thanks to (75), we can rewrite the conjunction as
[TABLE]
The right-hand side of (89) is an FAQ query. We solve it by running InsideOut over the tree decomposition . We repeat the above for every . Afterwards, because we have an exclusive OR over , we can simply sum up corresponding query results.
From (77), the total runtime is , where
[TABLE]
Finally we include the proof of Claim 2 for completeness, following the corresponding proof in [9, 8]. Consider a fixed . Assume to the contrary that for every tree decomposition , there is some bag for some such that . By definition of , for some . Therefore, for some . But this contradicts the claim that for every tree decomposition , . ∎
The following proposition shows that while can be larger than , it is not larger than and can be unboundedly smaller for classes of queries.
Proposition 3.16** (Connecting #smfw to smfw and faqw).**
- (a)
For any FAQ query , the following holds:
[TABLE]
In particular, when has no free variables, we have
[TABLE] 2. (b)
Furthermore, there are classes of queries for which the gap between and is unbounded, and so is the gap between and .
Proof.
First we prove part (a). The first inequality in (90) follows directly from the definitions of #smfw and smfw along with the fact that . To prove the second inequality in (90), we use the following variant of the Modularization Lemma from [8]:
Claim 3** (Variant of the Modularization Lemma [8]).**
Given a hypergraph and a set , we have
[TABLE]
*where ED is given by (5) and denotes the set of all modular functions . (A function is modular if .) *
Proof of Claim 92.
Obviously, the LHS of (92) is lowerbounded by the RHS. Next, we prove LHS RHS. W.L.O.G. we assume for some . Let . Define a function as follows:
[TABLE]
Obviously and . Next, we prove by proving that for every where for some , the following holds: .
The proof is by induction on . The base case when is trivial. For the inductive step, consider some where for some . Let be the maximum integer in , then by noting that , we have
[TABLE]
The first inequality above is by induction hypothesis, and the second inequality follows from the fact that is a -polymatroid (recall Definition 9). Both steps rely on the fact that for some . Consequently, . Since , this proves Claim 92. ∎
Now we prove the second inequality in (90):
[TABLE]
The fact that follows from the two sides being dual linear programs. (Recall the definition of from Section 2.1.)
Now, we prove part (b) of Proposition 3.16. In [8], we constructed a class of graphs/queries where the gap between fhtw and subw is unbounded. We will re-use the same construction here and prove that the upperbound on subw that we proved in [8] is also an upperbound on #subw. The upperbound proof is going to be different from [8] though since here we can only use -polymatroid properties to prove the bound (recall Definition 9).
Given integers and , consider a graph which is an “-fold -cycle”: The vertex set is a disjoint union of -sets of vertices. Each set has vertices in it, i.e., . There is no edge between any two vertices within the set for every , i.e., is an independent set. The edge set of the hypergraph is the union of complete bipartite graphs :
[TABLE]
Finally consider an FAQ query that has a finite-sized input factor for every , i.e., and (recall notation from Section 2.2). Assuming has no free variables, then and .
We proved in [8] that . Next we prove that . Let be any function in . We recognize two cases:
- •
Case 1: for some . WLOG assume . Consider the TD
I_{1}\cup I_{2}\cup I_{3}$$I_{1}\cup I_{3}\cup I_{4}$$I_{1}\cup I_{2k-1}\cup I_{2k}
For bag , using -polymatroid properties (Definition 9), we have
[TABLE]
- •
Case 2: for all . Consider the TD
I_{1}\cup I_{2}\cup\cdots\cup I_{k+1}$$I_{k+1}\cup I_{k+2}\cup\cdots\cup I_{2k}\cup I_{1}Bag Bag
For convenience, given any vertex , define the vertex set as follows:
[TABLE]
From -polymatroid properties, we have
[TABLE]
In a symmetric way, we can also show that . By setting , we prove that . Since , this proves part (b). ∎
Example 3.17**.**
Consider again the count query in (60), which we showed earlier how to compute in time . Since has no free variables, and . In the proof of Proposition 3.16, we show that . Therefore, the #PANDA algorithm from the proof of Theorem 3.15 can compute (60) in time . In fact, the algorithm we described earlier for (60) is just a specialization of #PANDA. The proof of Proposition 3.16 offers a family of similar examples.∎
3.3.2 FAQ-AI over an arbitrary semiring
Finally, we put everything together to solve the FAQ-AI problem. The only (very natural) change is to replace the tree decompositions by their relaxed version, and the technical details flow through.
Definition 3.18**.**
Given an FAQ-AI query (3) whose hypergraph is , its relaxed #-submodular FAQ-width, denoted by , is defined by
[TABLE]
When , we define .
Theorem 3.19**.**
Any FAQ-AI query of the form (3) on any semiring can be computed in time .
The proof of the above theorem is very similar to that of Theorem 3.15. The key difference is that instead of running InsideOut on individual FAQ queries obtained after applying #PANDA, we now run the InsideOut variant from Theorem 4. The proof is thus omitted.
Example 3.20**.**
Consider the following count query (which is similar to the counting version of query from Example 1.2):
[TABLE]
Let . For the above query . Any of the previously known algorithms, including the one from Theorem 4 and the one from Theorem 3.15, would need time to compute . We show below that . As an example of Theorem 3.19, we also show how to compute the above query in . (Using the same method, we can also solve the counting version of from Example 1.2 in the same time.)
First we prove that for the above query, . Here . We will use two relaxed tree decompositions in : The first has two bags and . The second has two bags and . (Both are relaxed TDs because the ligament edge is not contained in any bag; recall Definition 3.3.) Following (93), for each , we will pick one TD or the other. In particular, given some :
- •
If , then . We pick . From -polymatroid properties (Def. 9), we have
[TABLE]
- •
If , we pick .
[TABLE]
This proves that .
Finally, as a special case of #PANDA, we explain how to solve the above query in time (where recall ). Let
[TABLE]
Now we can write
[TABLE]
Both and above have sizes . Using the algorithm from the proof of Theorem 4, can be answered in time using the relaxed TD , while can be answered in the same time using . ∎
4 Applications to relational Machine Learning
Our FAQ-AI formalism and solution are directly applicable to learning a class of machine learning models, which includes supervised models (e.g., robust regression, SVM classification), and unsupervised models (e.g., clustering via -means). In this section, we show that the core computation of these optimization problems can be formulated in FAQ-AI over the sum-product semiring.
4.1 Training ML models over databases
A typical machine learning model is learned over a training dataset . We consider the common scenario where the input data is a relational database , and the training dataset is the result of a feature extraction join query over [38, 2, 3, 29, 22]. Each tuple consists of a vector of features of length and a label . We consider that the feature extraction query has the hypergraph , where is the set of its skeleton hyperedges.
A supervised machine learning model is a function with parameters that is used to predict the label for unlabeled data. The parameters are obtained by minimizing the objective function:
[TABLE]
where is a loss function, is a regularizer, e.g., or norm, and the constant controls the influence of regularization.
Previous work has shown that for polynomial loss functions, such as square loss , the core computation for optimizing the objective amounts to FAQ evaluation [2]. In many instances, however, the loss function is non-polynomial, either due to the structure of the loss, or the presence of non-polynomial components embedded within the model structure (e.g., ReLU activation function in neural nets) [33].
Examples of commonly used non-polynomial loss functions are: (1) hinge loss, used to learn classification models like linear support vector machines (SVM) [33], or generalized low rank models (glrm) with boolean principal component analysis (PCA) [42]; (2) Huber loss, used to learn regression models that are robust to outliers [33]; (3) scalene loss, used to learn quantile regression models [42]; (4) epsilon insensitive loss, used to learn SVM regression models [33]; and (5) ordinal hinge loss, used to learn ordinal regression models or ordinal PCA (another glrm) [42].
Any optimization problem with the above non-polynomial loss functions can benefit from our evaluation algorithm for FAQ-AI by reformulating computations in the optimization algorithm as FAQ-AI expressions over the feature extraction join query . We next exemplify this reformulation for the following problems:
- •
Learning a robust linear regression model using Huber loss, which can be solved with gradient-descent optimization
- •
Learning a linear regression model using the scalene, epsilon insensitive, and ordinal hinge loss functions.
- •
Learning a linear support vector machine (SVM) for binary classification using hinge loss, which can be solved with subgradient-based optimization algorithms or with a cutting-plane algorithm for the primal formulation of linear SVM classification.
- •
We also consider -means unsupervised clustering and give an FAQ-AI reformulation of the computation done in an iteration of the algorithm over the dataset .
The advantage of FAQ-AI reformulation is that the FAQ-AI expressions for the aforementioned optimization problems can be evaluated over relaxed tree decompositions of the feature extraction query and do not require the explicit materialization of its result . The size of and time to compute is [35]. The solution to these optimization problems can be computed in time sub-linear in the size of , using InsideOut or #PANDA.
4.2 Background: Gradient-based Optimization
In this section, we overview gradient-based optimization algorithms for convex and differentiable objective functions of the form (95). A gradient-based optimization algorithm employs the first-order gradient information to optimize . It repeatedly updates the parameters by some step size in the direction of the gradient \mbox{\boldmath\nabla}J(\bm{\beta}) until convergence. To guarantee convergence, it is common to use backtracking line search to ensure that the step size is sufficiently small to decrease the loss for each step. Each update step requires two computations: (1) Point evaluation: Given , compute the scalar ; and (2) Gradient computation: Given , compute the vector \mbox{\boldmath\nabla}J(\bm{\theta}).
There exist several variants of gradient descent algorithms, e.g., batch gradient descent or stochastic gradient descent, as well as many different algorithms to choose a valid step size [33]. For this work, we consider the batch gradient descent (BGD) algorithm with the Armijo backtracking line search condition, as depicted in Algorithm 1. A common choice for setting the step size is a function that is inversely related to number of iterations of the algorithm, for instance at iteration , where is the regularization parameter from (95) [40].
4.3 Robust linear regression with Huber loss
A linear regression model is a linear function with features and parameters . For a given feature vector , the model is used to estimate the (continuous) label . We learn the model parameters by minimizing the objective with the Huber loss function, which is defined as:
[TABLE]
Huber loss is equivalent to the square loss when and to the absolute loss otherwise101010Without loss of generality, we use a simplified Huber loss. The threshold between absolute and square loss is given by a constant and the absolute loss is .. In contrast to the absolute loss, Huber loss is differentiable at all points. It is also more robust to outliers than the square loss.
To learn the parameters, we use batch gradient-descent optimization, which repeatedly updates the parameters in the direction of the gradient \mbox{\boldmath\nabla}J(\bm{\beta}) until convergence. We provide details on gradient-based optimization in Section 4.2. In this section, we focus on the core computation of the algorithm, which is the repeated computation of the objective and its gradient \mbox{\boldmath\nabla}J(\bm{\beta}).
The gradient \mbox{\boldmath\nabla}J(\bm{\beta}) is the vector of partial derivatives with respect to parameters . (Note that the derivative of with respect to for any function of is always [math] whenever it is defined.) The objective function (with regularization) and its partial derivative with respect to are:
[TABLE]
Our observation is that we can compute and without materializing , by reformulating their data-dependent computation as a few FAQ-AI expressions. We explain the details next.
4.3.1 Reformulating the objective with Huber loss into FAQ-AI expressions
We show that the objective from (97) can be reformulated into FAQ-AI expressions of the form (3).
First, we consider the case where , i.e. the square loss term of . For ease of notation, let .
[TABLE]
Each summation over the training dataset in the final reformulation above can be expressed as one FAQ-AI query with two ligament hyperedges. For instance, the first summation over is equivalent to the following FAQ-AI expression:
[TABLE]
The absolute loss function for the case can be reformulated similarly:
[TABLE]
All of these terms can be reformulated as FAQ-AI expressions of the form (3).
Overall, the objective with Huber loss for learning robust linear regression models can be computed with FAQ-AI expressions, and without materializing the training dataset . Section 4.3.2 shows that the same holds for .
4.3.2 Reformulating the gradient with Huber loss into FAQ-AI expressions
We rewrite the first of the three summations in from (98) as follows:
[TABLE]
The four terms can be expressed as FAQ-AI expressions of the form (3). For instance, the first part of the expression is equivalent to the following FAQ-AI query:
[TABLE]
The other two summations in both aggregate over and have one inequality that defines a ligament in . They can be expressed as FAQ-AI expressions. Overall, the gradient \mbox{\boldmath\nabla}J(\bm{\beta}) can be expressed as FAQ-AI expressions.
Definition 4.1** (: The ligament extension of ).**
Given an FAQ query with hypergraph , define the ligament extension of , denoted by , to be an FAQ-AI query with hypergraph whose set of skeleton edges is identical to and whose set of ligament edges contains a single ligament edge , i.e. and .
Theorem 4.2**.**
Let be an input database where is the largest relation in , and be a feature extraction query. For any robust linear regression model , the objective and gradient \mbox{\boldmath\nabla}J(\bm{\beta}) with Huber loss can be computed in time with #PANDA and in time with InsideOut, where is the ligament extension of (Def. 4.1).
Proof.
Let be the number of variables in . We show in Sections 4.3.1 and 4.3.2 that we can rewrite objective and the gradient \mbox{\boldmath\nabla}J(\bm{\beta}) into FAQ-AI expressions with at most ligament hyperedges. The overall runtime bound for computing and \mbox{\boldmath\nabla}J(\bm{\beta}) with #PANDA follows from Theorem 3.19, which states that #PANDA can compute each FAQ-AI expression in time .
The overall runtime bound for computing and \mbox{\boldmath\nabla}J(\bm{\beta}) with InsideOut follows from Theorem 4, which states that InsideOut can compute each FAQ-AI expression in time . ∎
4.4 Further non-polynomial loss functions
In this section, we overview the following non-polynomial loss functions: (1) epsilon insensitive loss; (2) ordinal hinge loss; and (3) scalene loss. For each function, we define the loss function , the corresponding objective function , and the partial (sub)derivative which is used in (sub)gradient-based optimization algorithms. (Recall notation from Section 4.1.) In the derivations for the objective , we will focus on the loss function and ignore the regularizer for better readability.
As in the previous section, the objective and (sub)derivative can be reformulated into several FAQ-AI expressions of the form (3). Instead of writing out the expressions explicitly, we annotate those terms that can be reformulated. The actual reformulation should be clear from the examples in the previous sections.
Epsilon insensitive loss
The epsilon insensitive loss function [33] is defined as:
[TABLE]
This loss function is used to learn SVM regression models. We consider a linear regression model . The objective function and the corresponding partial subderivative with respect to are given by:
[TABLE]
The objective can thus be reformulated into FAQ-AI queries, while the gradient can be reformulated into queries: one for each for .
Ordinal hinge loss
The ordinal hinge loss [42] is defined as:
[TABLE]
The loss function is used to learn ordinal regression models or ordinal PCA [42]. A linear ordinal regression model is the linear function which predicts an ordinal label . The objective function and the partial subderivative with respect to are given by:
[TABLE]
The objective and partial subderivative can thus be reformulated as FAQ-AI expressions.
Scalene loss
The scalene loss function [42] is defined as:
[TABLE]
where is a constant.
The loss function is used to learn quantile regression models. We again consider a linear regression model . The objective function and the partial subderivative with respect to are given by:
[TABLE]
The objective and partial subderivative can thus be reformulated as FAQ-AI expressions.
Overall, we can reformulate the (sub)gradients under each one of the loss functions discussed in this section as FAQ-AI queries that are ligament extensions of the feature extraction query as per Def. 4.1.
4.5 Linear support vector machines
A linear SVM classification model is used for binary classification problems where the label . For the features , the model learns the parameters of a linear discriminant function such that separates the data points in into positive and negative classes with a maximum margin. The parameters can be learned by minimizing the objective function (95) with the hinge loss function:
[TABLE]
Hinge loss is non-differentiable, and thus standard gradient descent optimization is not applicable. We next discuss two alternative approaches for solving this optimization.
The first approach is based on the observation that the loss function is convex, and the objective admits subgradient vectors, which generalize the standard notion of gradient. The optimization problem can be solved with subgradient-based updates. Pegasos is a well-known algorithm for this approach [40].
The alternative approach is to solve the primal formulation of the problem, which avoids the non-differentiable objective by turning it into a constraint optimization problem with slack variables. Joachims proposed a cutting-plane algorithm which solves this optimization problem efficiently [25].
For both approaches, the number of iterations of the optimization algorithm is independent of the size of training dataset [40, 25]. Since each iteration takes time and the number of iterations is , it follows that the overall time complexity is .
Despite the fact that the two approaches solve the same problem, they have been hugely influential in their own right. We therefore consider both approaches, and show that by reformulating their computation as FAQ-AI we can solve them asymptotically faster than materializing the training dataset , i.e., sublinear in .
4.5.1 Background on Subgradient Descent
If the objective function is convex but not differentiable, the gradient \mbox{\boldmath\nabla}J(\bm{\beta}) is not defined. Such objective functions do, however, admit a subgradient, which can be used in subgradient-based optimization algorithms. Algorithm 1 naturally captures the batch subgradient-descent algorithm, if the parameters are updated in the direction of the subgradient as opposed to the gradient.
A popular application for subgradient-descent optimization algorithms is the learning of linear SVM models. One such algorithm is the Pegasos algorithm [40], which showed that subgradient methods can learn the parameters of the model significantly faster than other approaches, including Joachims’ cutting plane algorithm [25].
4.5.2 Subgradient-based optimization for linear SVM classification
We first use subgradient-based optimization to compute the parameters of the SVM model; see Section 4.5.1 for some background. The core of the optimization is the repeated computation of the objective and the partial derivatives in terms of . The objective (with regularization) and the partial derivative are:
[TABLE]
Both and can be reformulated as FAQ-AI expressions and computed without materializing . We first rewrite the objective:
[TABLE]
In the above, the sum for example can be expressed as an FAQ-AI query of the form (3) as follows:
[TABLE]
can also be rewritten into two FAQ-AI expressions:
[TABLE]
Theorem 4.3**.**
Let be an input database where is the largest relation in , and be a feature extraction query. For any linear SVM classification model , the objective and gradient \mbox{\boldmath\nabla}J(\bm{\beta}) with hinge loss can be computed in time with #PANDA and in time with InsideOut, where is the ligament extension of (Def. 4.1).
Proof.
Let be the number of variables in . We show above that and \mbox{\boldmath\nabla}J(\bm{\beta}) can be rewritten into FAQ-AI expressions with a single ligament hyperedge (i.e. ). The overall runtime bound for computing and \mbox{\boldmath\nabla}J(\bm{\beta}) with #PANDA follows from Theorem 3.19, which states that #PANDA can compute each FAQ-AI query in time . The runtime for computing and \mbox{\boldmath\nabla}J(\bm{\beta}) with InsideOut follows from Theorem 4: This is for a FAQ-AI query . ∎
4.5.3 Cutting-plane algorithm for linear SVM classification in primal space
An alternative to learning linear SVM via subgradient-based optimization is to pose the problem as a constraint optimization problem. The equivalent formulation for minimizing the objective (101) is the primal formulation of linear SVM [33]:
[TABLE]
where are slack variables and is the regularization parameter.
The optimization problem solves for the hyperplane that classifies the data points into two classes, so that the margin between the hyperplane and the nearest data point for each class is maximized. For each , the slack variable encodes how much the point violates the margin of the hyperplane.
Joachims’ cutting-plane algorithm solves (105) in linear time over the training dataset [25]. The algorithm solves the following structural classification SVM formulation, which is equivalent to (105):
[TABLE]
This formulation has constraints, one for each possible subset , and a single slack variable that is shared by all constraints.
Algorithm 2 presents Joachims’ cutting-plane algorithm for solving (106). It iteratively constructs a set of constraints , which is a subset of all constraints in (106). In each round , it first computes the optimal value for and over the current working set . Then, it identifies the constraint that is most violated for the current , and adds this constraint to . It continues until is violated by at most . Joachims showed that Algorithm 2 finds the -approximate solution to (106) in -many iterations [25]. Hence and the number of constraints of the optimization problem are bounded by a number independent of .
Next, we consider the inner optimization problem at line 5. Although is small, the number of variables can still be large. This prohibits solving with quadratic programming as it can take up to [33]. Its Wolfe dual, on the other hand, is a quadratic program with only a constant number of variables that is independent of and one constraint. Let . We next present the derived Wolfe dual.
Wolfe dual for optimization problem at line 5 of
Algorithm 2
We consider the inner optimization problem at line 5 of Algorithm 2, show how to derive the Wolfe dual (109) from the structural SVM classification formulation (106). Let . The inner optimization problem at line 5 of Algorithm 2 is of the form:
[TABLE]
The Lagrangian function of this optimization problem is:
[TABLE]
where and are Lagrange multipliers.
Since the Lagrangian is convex and continuously differentiable, we can define the Wolfe dual as the following optimization problem:
[TABLE]
The optimal condition for is . We use this equality to rewrite the above dual formulation and attain the following optimization problem:
[TABLE]
where is the vector of constraints.
Theorem 4.4**.**
Let be an input database where is the largest relation in , and be a feature extraction query. A linear SVM classification model can be learned over the training dataset with Joachims’ cutting-plane algorithm in time with #PANDA and in time with InsideOut, where is the ligament extension of (Def. 4.1).
Proof.
Recall that for each iteration of Algorithm 2, we add one set to , and is associated with a coefficient vector . Our main observation is that we do not have to materialize the set , since it is completely determined by the data and the coefficient vector . Thus, instead of storing we can simply store and reformulate the data dependent term in (109) as a computation over :
[TABLE]
The vector has size . For each , we can compute the ’th component of as the summation of the following two FAQ-AI expressions, which are of form (3):
[TABLE]
and have a single ligament hyperedge (i.e. ). Theorem 3.19 states that #PANDA computes for in time . Consequently, the optimization problem at line 5 of Algorithm 2 can be computed in time . This determines the runtime of Algorithm 2.
Using InsideOut, the runtime of Algorithm 2 follows from Theorem 4: This is for . ∎
4.6 -means clustering
We next consider -means clustering, which is a popular unsupervised machine learning algorithm.
An unsupervised machine learning model is computed over a dataset , for which each tuple is a vector of features without a label. A clustering task divides into clusters of “similar” data points with respect to the norm: , where is a given fixed positive integer. Each cluster is represented by a cluster mean . One of the most ubiquitous clustering methods, Lloyd’s -means clustering algorithm (also known as the -means method), involves the optimization problem (1) with respect to the partition and the means . Other norms or distance measures can be used, e.g., if we replace with -norm, then we get the -median problem. The subsequent development considers the -norm.
Lloyd’s algorithm can be viewed as a special instantiation of the Expectation-Maximization (EM) algorithm. It iteratively computes two updating steps until convergence. First, it updates the cluster assignments for each :
[TABLE]
and then it updates the corresponding -means :
[TABLE]
Our observation is that we can reformulate both update steps (110) and (111) as FAQ-AI expressions, without explicitly computing the partitioning . For a given set of -means , let be the following function:
[TABLE]
where is the ’th component of mean vector . A data point is closest to center if and only if holds . We use this inequality to reformulate the mean vector as FAQ-AI expressions. First, we express as:
[TABLE]
Then, for each , the sum can be reformulated in FAQ-AI as follows (similarly to (4)):
[TABLE]
Each component equals the division of by .
Overall, the mean vector can be computed with FAQ-AI expressions of the form (3).
Theorem 4.5**.**
Let be an input database where is the largest relation in , and be a feature extraction query where is the number of its variables. Each iteration of Lloyd’s -means algorithm can be computed in time with #PANDA and in time with InsideOut, where is the ligament extension of (Def. 4.1).
Proof.
We have shown above that each mean vector can be computed with FAQ-AI expression of the form (3), where each query has ligament hyperedges. For #PANDA, the overall runtime to update all -means follows from Theorem 3.19 (respectively Theorem 4), which states that the algorithm can compute each FAQ-AI expression of form (3) in time . Using InsideOut, the runtime follows from Theorem 4: Any FAQ-AI query of form (3) can be computed in time . ∎
5 Conclusion
We presented a theoretical and algorithmic framework for solving a special class of functional aggregate queries that arise naturally within many in-database machine learning problems and captures a variety of database queries including inequality joins. In this query class, called FAQ-AI, some of the input factors happen to be additive inequalities over some input variables. We showed that FAQ-AI queries can be solved more efficiently than general FAQ queries by relaxing the notion of tree decompositions leading to relaxed versions of commonly used width parameters.
While FAQ queries over the Boolean semiring are solvable within the tighter bound of submodular width [32, 9], such a bound is not known to be achievable over arbitrary semirings, including count queries. Therefore, we first introduced a counting analog of the submodular width, denoted #subw, by relaxing the notion of polymatroids, and showed how to meet this bound for FAQ queries over any semiring. We then turned our attention back to the special case of FAQ-AI and showed how to strengthen the bound further in this case.
We showed how to use our framework to solve several common machine learning problems over relational data asymptotically faster than both out-of-database and previously known in-database machine learning solutions. These problems include -means clustering, support vector machines, and regression over a variety of non-polynomial loss functions.
One interesting open problem is to prove a hardness result for count queries with unbounded #subw. On one hand, this would show the tightness of our positive result for solving FAQ queries over arbitrary semirings within #subw bound. On another, this would mirror the previously known dichotomy result for query classes over the Boolean semiring based on the submodular width [32].
Another remaining problem is to measure the gap between the submodular width and its counting version #subw. More precisely, is there a class of queries where the submodular width is unboundedly smaller than #subw?
Marx [32] showed a class of queries where the submodular width is bounded while the fractional hypertree width is unbounded. Proposition 3.16 showed a class of queries where the gap between #subw and the fractional hypertree width is unbounded (but #subw is also unbounded). It remains open to show whether there exists a query class where #subw is bounded and the fractional hypertree width is unbounded.
While the FAQ-AI framework can be used to optimize machine learning problems over several non-polynomial loss functions including those presented in Section 4.3 and 4.4, other classes of loss functions are not representable as FAQ-AI queries and do not benefit from this framework yet. These classes include for example the logistic and exponential losses commonly used for classification problems. It would be interesting to see if such loss functions could eventually be optimized in the same way in the in-database machine learning setting.
Acknowledgments
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 682588. LN gratefully acknowledges support from NSF grants CAREER DMS-1351362 and CNS-1409303, Adobe Research and Toyota Research, and a Margaret and Herman Sokol Faculty Award. BM’s was supported in part by a Google Research Award, and NSF grants CCF-1830711, CCF-1824303, and CCF-1733873.
Appendix A Recovering Two Existing Results
In this section we review two prior results concerned with the evaluation of queries with inequalities: the evaluation of Core XPath queries over XML documents via relational encoding in the pre/post plane and the exact inference for IQ queries with inequality joins over probabilistic databases. Our main observation is that their linearithmic complexity is due to the same structural property behind relaxed tree decompositions: Such queries admit trivially a relaxed tree decomposition, where each bag corresponds to one relation in the query and the ligament edges, i.e., the inequality joins, are covered by neighboring bags.
A.1 Core XPath Queries
We consider the problem of evaluating Core XPath queries over XML documents. An XML document is represented as a rooted tree whose nodes follow the document order. Core XPath queries define traversals of such trees using two constructs: (1) a context node that is the starting point of the traversal; and (2) a tree of location steps with one distinguished branch that selects nodes and all other branches conditioning this selection. Given a context node , a location step selects a set of the nodes in the tree that are accessible from via the step’s axis. This set of nodes provides the context nodes for the next step, which is evaluated for each such node in turn. The result of the location step is the set of nodes accessible from any of its input context nodes, sorted in document order.
The preorder rank of a node is the index of in the list of all nodes in the tree that are visited in the (depth-first, left-to-right) preorder traversal of the tree; this order is the document order. Similarly, the postorder rank of is its index in the list of all nodes in the tree that are visited in the (depth-first, left-to-right) postorder tree traversal. We can use the pre/post-order ranks of nodes to define the main axes descendant, ancestor, following, and preceding [20]. Given two nodes and in the tree, the four axes are defined using the pre/post two-dimensional plane:
- •
is a descendant of or equivalently is an ancestor of
[TABLE]
- •
follows or equivalently precedes
[TABLE]
The remaining axes parent, child, following-sibling, and preceding-sibling are restrictions of the four main axes, where we also use the parent information for each node:
- •
is a child of or equivalently is a parent of
[TABLE]
- •
is a following sibling of or equivalently is a preceding sibling of
[TABLE]
We follow the standard approach to reformulate XPath evaluation in the relational domain [20]. We represent the document by a factor in the Boolean semiring with schema . For each node in the tree there is one tuple in with and ranks, label , and preorder rank of the parent node. A query with location steps is mapped to an FAQ-AI expression that is a join of copies of where the join conditions are the inequalities encoding the axes of the steps. The first copy is for the initial context node(s). The axis of the -th step is translated into the conjunction of inequalities between pre/post rank variables of the copies and . The query has one free variable: This is the preorder rank variable from the copy of corresponding to the location step that selects the result nodes.
Example A.1**.**
The Core XPath query
[TABLE]
selects all -labeled nodes following -labeled nodes that are descendants of the given context node and that have at least one -labeled descendant node. The steps in the above textual representation of the query are separated by /. The brackets [ ] delimit a condition on the selection of the -labeled nodes. We can reformulate this query in FAQ-AI over the Boolean semiring as follows:
[TABLE]
The hypergraph of a relational encoding of a Core XPath query has one skeleton hyperedge for each copy of the document factor and one ligament edge for each pair of inequalities over two of these copies. Any two skeleton hyperedges may only have one node, i.e., query variable, in common to express the parent/child or sibling relationship between their corresponding steps. This hypergraph admits a trivial relaxed tree decomposition, which mirrors the tree structure of the query. In particular, there is one bag of the decomposition consisting of the variables of each copy of the document factor. Each ligament edge represents a pair of inequalities over variables of two neighboring bags. The running intersection property holds since the equalities are by construction only over variables from neighboring bags.
It is known that the time complexity of answering a Core XPath query with location steps over an XML document is (Theorem 8.5 [18]; it assumes the document factor sorted). We can show a linearithmic time complexity result using our FAQ-AI reformulation of Core XPath queries and the trivial relaxed tree decomposition.
Proposition A.2**.**
For any Core XPath query with location steps and XML document , the query answer can be computed in time .
Proof.
Let be the FAQ-AI reformulation of and the factor representing the XML document . There is a one-to-one correspondence between the trivial relaxed tree decomposition and the XPath query, with one bag per location step. Let be the number of location steps in , or equivalently the number of bags in the tree decomposition. We consider this trivial tree decomposition and choose its root as the bag corresponding to the location step that selects the answer node set. Our evaluation algorithm proceeds in a bottom-up left-to-right traversal of the tree decomposition and eliminates one bag at a time.
We index the bags and their corresponding factors in this traversal order. The first factor to eliminate is then denoted by while the last factor, which corresponds to the location step selecting the answer node set, is denoted by .
We initially create factors that are copies of factors corresponding to leaf bags in the tree. Consider now two factors and corresponding to a leaf bag and respectively to its parent bag. Let be the conjunction of inequalities defining the axis relationship between the location steps corresponding to these bags. We then compute a new factor that consists of those tuples in that join with some tuples in . This is expressed in FAQ-AI over the Boolean semiring:
[TABLE]
The conjunction only has two inequalities on variables between the two bags. Computing takes time following the algorithm from the proof of Theorem 4. We can sort both and in ascending order on the preorder column and in descending order on the postorder column. For each tuple in , the tuples in that join with form a contiguous range in . To assert whether is in , it suffices to check that this range is not empty. There are such steps and , with an overall time complexity of . ∎
A.2 Probabilistic Queries with Inequalities
The problem of query evaluation in probabilistic databases is #P-hard for general queries and probabilistic database formalisms [41]. Extensive prior work focused on charting the tractability frontier of this problem, with positive results for several classes of queries on so-called tuple-independent probabilistic databases. We discuss here one such class of queries with inequality joins called IQ [36].
A tuple-independent probabilistic database is a database where each tuple is associated with a Boolean random variable that is independent of the other tuples in the database. This is the database formalism of choice for studies on query tractability since inference is hard already for trivial queries on more expressive probabilistic database formalisms [41].
FAQ factors naturally capture tuple-independent probabilistic databases: A tuple-independent probabilistic relation is a factor that maps each tuple in to the probability that the associated random variable is true.
We next define the class IQ of inequality queries and later show how to recover the linearithmic time complexity for their inference.
Definition A.3** (adapted from Definitions 3.1, 3.2 [36]).**
Let a hypergraph , where and are disjoint, consists of pairwise disjoint sets, consists of sets for which there is a vector , and . An IQ query has the form
[TABLE]
where are distinct factors. ∎
The edges (i.e., binary hyperedges) in correspond to inequalities of the query variables. These inequalities are restricted so that there is at most one node (query variable) from any hyperedge in . Inequalities on variables of the same factor are not in ; they can be computed trivially in a pre-processing step.
The inequalities may only have the form or . They induce an inequality graph where is a parent of if . This graph can be minimized by removing edges corresponding to redundant inequalities implied by other inequalities [23]. Each graph node thus corresponds to precisely one factor. We categorize the IQ queries based on the structural complexity of their inequality graphs into (forests of) paths, trees, and graphs.
Example A.4**.**
Consider the following IQ queries:
[TABLE]
The inequalities form a path in and a tree in .
The probability a query over a probabilistic database is the probability of its lineage [41]. The lineage is a propositional formula over the random variables associated with the input tuples. It is equivalent to the disjunction of all possible derivations of the query answer from the input tuples.
Example A.5**.**
Consider the factors , , , where , , denote the variables associated with the tuples in these factors and for a random variable , denotes the probability that :
[TABLE]
The lineage of and over these factors is:
[TABLE]
[TABLE]
Prior work (Theorem 4.7 [36]) showed that the probability of an IQ query with an inequality tree with nodes over a tuple-independent probabilistic database of size can be computed in time using a construction of the query lineage in an Ordered Binary Decision Diagram (OBDD). We show next that a variant of the algorithm in the proof of Lemma 3.1, adapted from counting to weighted counting, i.e., probability computation, can compute the probability in time , thus shaving off an exponential factor in the number of inequalities.
We first explain this result using two examples, which draw on a crucial observation made in prior work [36]: The lineage of IQ queries has a chain structure: For each factor, there is an order on its random variables that defines a chain of logical implications between their cofactors in the lineage: the cofactor of the first variable implies the cofactor of the second variable, which implies the cofactor of the third variable, and so on.
Example A.6**.**
We continue Example A.5. The lineage of and is arranged so that the chain structure becomes apparent. This structure allows for an equivalent rewriting of the lineage [36], as shown next for the lineage of (for a random variable , denotes its negation):
[TABLE]
In disjunctive normal form, the lineage of may have size cubic in the size of the database. The factorization of the lineage in Example A.5 lowers the size to quadratic. The above rewriting further reduces the size to linear. The rewritten form can be read directly from the input factors following the structure of the inequality tree.
Since the above expressions are sums of two mutually exclusive formulas, their probabilities are the sums of the probabilities of their respective two formulas. Their probabilities can be computed in one bottom-up right-to-left pass: First for in decreasing order of , then for in decreasing order of , and finally for in decreasing order of . We extend the probability function from input random variables to formulas over these variables. The probability of ’s lineage, which is also the probability of , is ():
[TABLE]
Since there are no variables , , and , we use . This computation corresponds to a decomposition of that can be captured by a linear-size OBDD [36].
The probability of the lineage of is computed similarly ():
[TABLE]
This computation would correspond to a decomposition of that can be captured by an OBDD with several nodes for a random variable from and ; in general, such an OBDD would have a size linear in but with an additional exponential factor in the size of the inequality tree due to the inability to represent succinctly the products of lineage over and of lineage over [36]. (OBDDs with AND nodes can capture such products without this exponential factor, though in this article we do not use them.)∎
Proposition A.7**.**
Given a tuple-independent probabilistic database of size and an IQ query with a forest of inequality trees, we can compute the probability of over in time .
Proof.
We next present the inference algorithm for a given IQ query with an inequality tree. It uses a minor variant of the algorithm from the proof of Lemma 3.1 to compute a functional aggregate query with additive inequalities over two factors.
We first reduce the input database to a simplified database of unary and nullary factors that is constructed by aggregating away all query variables that do not contribute to inequalities.
Let us partition into the hyperedges that contain query variables involved in inequalities and all other hyperedges .
We reduce each factor with a query variable occurring in inequalities to a unary factor by aggregating away all other query variables. For an -value , gives the probability of the disjunction of the independent random variables associated with the tuples in that have the -value :
[TABLE]
We also reduce all factors with no query variable occurring in inequalities to one nullary factor by aggregating away all query variables. gives the probability of the conjunction of all factors without query variables in inequalities:
[TABLE]
This simplification reduces the set of hyperedges to a new set of unary edges, one per query variable in the inequalities, and one nullary edge: . The simplification does not affect the inference problem: The probability of is the same as the probability of the query over :
[TABLE]
The hypergraph of trivially admits the relaxed tree decomposition whose structure is that of the inequality tree of (and of ): The skeleton edges are and the ligament edges are .
The inference algorithm traverses the inequality tree bottom-up and eliminates one level of query variables at a time. For a variable with children , it computes recursively the factor
[TABLE]
We use to find the value in that is the least upper bound of and to find the value in that is the least strict upper bound of , i.e., the next value in ascending order. The definition of is recursive: It first computes the probability for and then for its previous values. In case has no children, i.e., the product over is one.
The probability of is then the product of and the probability of the first tuple in the factor of the root variable. If has a forest of inequality trees, then the subqueries for the trees would be disconnected and thus correspond to independent random variables. The probability of is then the product of the probabilities of the independent subqueries. ∎
The case of inequality graphs can be reduced to that of inequality trees by variable elimination. The elimination of a variable repeatedly replaces it in the query by a value from its domain. The inequality graph of this residual query has no node for and none of its edges. By removing variables to obtain an inequality tree, the complexity of computing the query probability increases by at most the product of the sizes of the factors having these variables.
Appendix B Omitted Details about Tree Decompositions
Here we prove Proposition 2.4, which is re-stated below.
Proposition B.1** (Re-statement of Proposition 2.4).**
For every tree decomposition of a query , there exists a non-redundant tree decomposition of that satisfies
[TABLE]
Moreover, if is -connex, then can be chosen to be -connex as well.
Proof.
Given a redundant tree decomposition , by Definition 2.3 there must exist where . We claim that and can be chosen to be adjacent in the tree . In particular, if and from Definition 2.3 are already adjacent, we are done. Otherwise, consider the node that is adjacent to on the path from to in the tree . By the running intersection property, we have . Therefore if we replace with , we obtain two new adjacent nodes and satisfying .
Now we modify the tree decomposition by removing from and connecting all the neighbors of (other than ) directly to . It is straightforward to verify that this modification results in a valid tree decomposition . Moreover this modification maintains the -connex property of the original tree decomposition, if it was -connex in the first place. If the new tree decomposition is non-redundant, we are done. Otherwise, we inductively repeat the above process by finding a new adjacent pair satisfying . (This induction is over the number of bags in the tree decomposition since each time we are dropping one bag.) ∎
Appendix C The InsideOut Algorithm
In this section, we aim to provide a proof sketch for Theorem 2.5. We refer the reader to [6] and its extended version [5] for more details. The proof also sheds light on many omitted technical details in the proofs of Theorems 4 and 3.12 including how to generalize theorems from the case of no free variables to the case of an arbitrary set of free variables.
Theorem C.1** (Re-statement of Theorem 2.5).**
InsideOut* answers query (2) in time .*
Proof.
Let be an FAQ-query of the form (2) with hypergraph and free variables . Let . By definition of from (11), there must exist an -connex tree decomposition where all bags satisfy
[TABLE]
Moreover by Proposition 2.4, the above tree decomposition can be assumed to be non-redundant. By Definition 2.2, there must exist a (possibly empty) subset that forms a connected subtree of and satisfies . Fix a root of the tree decomposition to be:
- •
either an arbitrary node from if is not empty,
- •
or an arbitrary node from if is empty.
Based on the above choice of the root , the following holds:
Claim 4**.**
If , then there must exist a leaf node .
If is empty, then the above claim holds trivially. Otherwise, the above claim holds because the root belongs to the connected subtree .
We recognize two cases:
Case 1: . In this case, (since and ). By Claim 4, let be a leaf node from , and let be the parent of . Let , , and . Because the tree decomposition is non-redundant (thanks to Proposition 2.4), we have .
Claim 5**.**
For any with , we must have .
The above claim holds by the definition of a tree decomposition from Section 2.1: Otherwise, the running intersection property would break.
To rewrite query (2), we need to utilize the notion of indicator projection from Definition 3.6 along with its property given by (25). Query (2) can be written as:
[TABLE]
The last equality above holds because of the distributive property of semirings. We define the product inside the inner sum to be a query , which is associated with the bag . Note that by Claim 5, all factors and in this product involve only variables from .
Query can be computed with the help of worst-case optimal join algorithms [34, 35, 43]. In particular, for every where , define to be the support of the factor , i.e.
[TABLE]
can be viewed as a relation over variables . Solving the FAQ-query can be reduced to solving the join query defined as follows:
[TABLE]
This is because once we solve the join query , the FAQ-query can be computed as follows:
[TABLE]
where above denotes the output of the join query . The join query can be computed using a worst-case optimal join algorithm in time , which is by (117).
Once we have computed , we use it to compute defined as follows:
[TABLE]
The above can be computed by sorting tuples that satisfy lexicographically based on so that tuples sharing the same -prefix become consecutive. Then for each distinct -prefix, we aggregate away over all tuples sharing that prefix.
Finally, expression (121) can be rewritten as:
[TABLE]
The above is an FAQ-query of the same form as (2). It admits an -connex tree decomposition that results from the original -connex tree decomposition by removing the leaf bag . In particular, the newly added hyperedge (corresponding to ) is contained in , and all other properties of -connex tree decompositions continue to hold after the removal of . Moreover thanks to the fact that , the new query (124) has strictly less variables than the original query (2). In particular, the new query only involves the variables while the original query involves . (We say that variables have been eliminated from the original query hence the term “variable elimination”.) By induction on the number of variables, we can solve the original query (2) in the claimed time of . (In the base case, we have an FAQ-query with no variables, where the theorem holds trivially.)
Case 2: . Let be an arbitrary leaf node and its parent. Let and be defined as before. Claim 5 continues to hold. In this case, query (2) can be written as:
[TABLE]
Just like in the previous case, we use a worst-case optimal join algorithm to compute above in time . Once we do, we compute its indicator projection:
[TABLE]
Now (127) can be written as:
[TABLE]
Note that thanks to the indicator projection that is included in the new query above, the following holds: For every tuple that satisfies , there must exist at least one tuple that satisfies . This in turn implies that:
[TABLE]
By induction on the number of variables, we solve the new query (which doesn’t have a bag nor variables ) in time , which is thanks to (131). Finally, we compute the original query using the expression
[TABLE]
In particular, the above expression can be computed in time as follows. First, we index tuples satisfying so that for a given we can enumerate in constant delay all tuples where . After that, we iterate over tuples satisfying , extract the -part out of each such -tuple, and then use the previous index of to enumerate -tuples corresponding to . ∎
Appendix D The PANDA Algorithm
In this section, we give an overview of the PANDA algorithm developed in [9] along with its extended version [8]. The aim is to fill out omitted technical details in the proof of Theorem 3.15, which introduces a variant of PANDA called #PANDA.
Following notation from Section 1.2, the input to the PANDA algorithm is as follows:
- •
A multi-hypergraph 111111See definition in Section 1.2. .
- •
A relation associated with each hyperedge . The arity of is .
- •
A disjunctive Datalog query of the form
[TABLE]
where .
The output of PANDA is a collection of tables that form a solution to the disjunctive Datalog query (133) (which can have many solutions). In particular, the tables must satisfy the following condition:
Each tuple that satisfies the conjunction must also satisfy the disjunction .
Following notation from Sections 2.1 and 2.2, the runtime of PANDA is , where
[TABLE]
(Recall that hides a polylogarithmic factor in .) We start with some preliminaries. The following lemma shows how to convert the expression (134) into a linear program.
Lemma D.1** ([9, 8]).**
There exists a non-negative vector satisfying and
[TABLE]
Note that the right-hand side of (135) is a linear program: Its variables are , its objective function is , and its constraints are and , which are all linear. (Recall the definitions of and from Section 2.1 and (9) respectively.) Our next step is to reduce solving this linear program into finding a Shannon inequality, defined below.
Definition D.2** (Shannon inequality).**
Given real constants (where each could be either positive, negative, or zero), the linear inequality is called a Shannon inequality if it holds for all .
Let OPT be the optimal solution to the linear program from the right-hand side of (135):
[TABLE]
By linear programming duality, the following lemma was proved in [8].
Lemma D.3** ([9, 8]).**
There exists a non-negative vector satisfying the following conditions:
- •
The inequality
[TABLE]
is a Shannon inequality.
- •
[TABLE]
Shannon-flow inequalities [8] is a special class of Shannon inequalities that subsumes inequality (137). It enjoys certain properties that the PANDA algorithm relies on. Given , let denote
[TABLE]
Definition D.4** (Shannon-flow inequality [8]).**
Given , let be a non-negative vector. Let be another non-negative vector. A Shannon-flow inequality is a Shannon inequality that has the following form:
[TABLE]
Note that (137) is a special case of (140) where for and otherwise.
Lemma D.5** (Proof sequence construction [9, 8]).**
Every Shannon-flow inequality (140) admits a proof of the following form. Start from the right-hand side of (140), apply a sequence of proof steps each of which replaces a term (or more) with a smaller term (or more), until we end up with the left-hand side of (140) (which proves that the left-hand side is smaller than the right-hand side). Each proof step in the sequence has one of the following forms:
[TABLE]
Each proof step in (141)-(144) is interpreted as replacing the term(s) on the left-hand side of the step with the terms(s) on the right-hand side. Note that for each step in (141)-(144) and each , the right-hand side of the step is guaranteed to be smaller than the left-hand side. For example, consider the submodularity step (143), where we replace with . Because , it must satisfy the inequality (Recall the definition of in Section 2.1). But this inequality can be rearranged into . Similarly consider the monotonicity step (144), where we replace with for some . Since , it must satisfy whenever .
The PANDA algorithm starts from the target runtime bound where is given by (134), computes a corresponding Shannon-flow inequality (137) from Lemma D.3 (where thanks to Lemma 135), and then uses Lemma D.5 to construct a proof sequence for this inequality consisting of proof steps . After that the algorithm mimics the process of using this proof sequence to prove inequality (137). In particular, it starts from the right-hand side of (137) associating each entropy term with a corresponding input relation . After that it starts applying the proof steps one by one: Each time a proof step is applied to replace some entropy terms on the right-hand side of (137) with new entropy terms, the algorithm takes the relations associated with the old terms, applies some relational operator on them to produce new relations, and associates the new relations with the new entropy terms. At the end of the proof sequence, we would have completely transformed the right-hand side of (137) into the left-hand side completing the proof. At that time, PANDA would have computed relations associated with entropy terms on the left-hand side of (137). Those particular relations form a solution to the input disjunctive Datalog rule (133). Moreover the algorithm ensures that every relational operator that was performed while mimicking the proof sequence took time within our target runtime bound of .
Before formally describing the invariants maintained by the algorithm, we need some notation.
Definition D.6** (Degrees in a relation).**
Given a relation and a set , the degree of w.r.t. a tuple and w.r.t. to are defined as follows:
[TABLE]
As a special case, we have .
Although the PANDA algorithm starts from a Shannon-flow inequality of the special from (137), after applying a decomposition proof step (for some ) replacing some term on the right-hand side of (137) with new terms , the resulting inequality no longer falls under the special form (137). Instead it falls back to the more general form of a Shannon-flow inequality (140). Therefore, in general the PANDA algorithm maintains a Shannon-flow inequality (140).
The PANDA algorithm maintains the following invariants:
- (I1)
Every term on the right-hand side of (140) is associated with a relation satisfying . The relation is called the guard of the term . (Note that if , then .)
- (I2)
The guards satisfy the following:
[TABLE]
where is the input size and is given by (134). For convenience, we define where is the guard of .
- (I3)
Every guard satisfies
[TABLE]
Initially, the above invariants are satisfied. In particular, inequality (140) at the beginning is just (137) where each is guarded by thus satisfying invariant (I1). Moreover, (147) is satisfied as follows:
[TABLE]
Also (148) is satisfied because initially each input relation satisfies . (It is straightforward to verify that defined by (134) is at least .)
Next we describe how PANDA handles each type of proof steps (141)-(144) while maintaining the above invariants and also ensuring that all operations are performed in time .
Case 1: Submodularity step for some . Let be the guard of the term . We can directly use as a guard of the new term thus satisfying invariant (I1). Since both terms share the same guard, we have hence the left-hand side of (147) remains unchanged and invariant (147) remains satisfied. Invariant (148) remains satisfied as well.
Case 2: Monotonicity step for . Let be the guard of . We use as a guard of the new term . We have hence the left-hand side of (147) does not increase and invariant (147) remains satisfied. Invariant (148) remains satisfied because . Moreover since by invariant (148), the projection can be computed in our target runtime bound of .
Case 3: Decomposition step for . Let be the guard of . In this case, we partition into a small number of relations , branch the execution of the algorithm into different branches where is replaced with on the -th branch for , and continue the algorithm on each branch separately, and combine the outputs at the very end. Because of the logarithmic number of branches created at each decomposition step, the runtime of the algorithm blows up from the ideal bound of to where hides a polylogarithmic factor in .
In particular, we partition tuples into buckets based on and partition accordingly. Specifically, for each , we define
[TABLE]
After partitioning, PANDA creates independent branches of the problem, where in the -th branch, is replaced by both and . Note that for each , the size of is at most therefore:
[TABLE]
Moreover if we partition each further into two parts, we can get rid of the division by in (154) at the cost of doubling the number of branches.
Now on the -th branch, we replace the term with the two terms and , which are guarded by and respectively. By taking the of both sides of (154) (and ignoring the division by 2), we have hence the left-hand side of (147) does not increase and invariant (147) remains satisfied. Moreover, because , invariant (148) remains satisfied as well. Finally because thanks to invariant (148), the above partitioning of can be performed in time as needed.
Case 4: Composition step for . Let be the guard of and be the guard of . Recall from (I1) that this implies . In this case, we compute the join by going over tuples in , projecting each one of them onto , and finding the matching tuple in (which can be done efficiently by a proper indexing of ). The output size of this join satisfies
[TABLE]
Moreover the join can be computed in time proportional to . We use the join result as a guard for the new term that results from applying the proof step. From (155), we have hence the left-hand side of (147) does not increase and invariant (147) remains satisfied.
It remains to verify that the above step maintains invariant (148) and can also be performed in the desired time of . As it turns out, neither is true: Both the size of the new relation and the time it takes to compute it can exceed . In order to enforce these invariants, some new technical ideas are needed that are beyond the scope of this short introduction to PANDA. We refer the reader to [8] for a detailed explanation of how to handle this last case properly without violating the invariants of the algorithm.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Abo Khamis, M., Curtin, R. R., Moseley, B., Ngo, H. Q., Nguyen, X., Olteanu, D., and Schleich, M. On functional aggregate queries with additive inequalities. In PODS (2019), pp. 414–431.
- 2[2] Abo Khamis, M., Ngo, H. Q., Nguyen, X., Olteanu, D., and Schleich, M. In-database learning with sparse tensors. In PODS (2018), pp. 325–340.
- 3[3] Abo Khamis, M., Ngo, H. Q., Nguyen, X., Olteanu, D., and Schleich, M. Learning models over relational data using sparse tensors and functional dependencies. ACM Trans. Database Syst. (2020).
- 4[4] Abo Khamis, M., Ngo, H. Q., Olteanu, D., and Suciu, D. Boolean tensor decomposition for conjunctive queries with negation. In ICDT (2019), pp. 21:1–21:19.
- 5[5] Abo Khamis, M., Ngo, H. Q., and Rudra, A. FAQ: questions asked frequently. Co RR abs/1504.04044 (2015).
- 6[6] Abo Khamis, M., Ngo, H. Q., and Rudra, A. FAQ: questions asked frequently. In PODS (2016), pp. 13–28.
- 7[7] Abo Khamis, M., Ngo, H. Q., and Rudra, A. Juggling functions inside a database. SIGMOD Rec. 46 , 1 (2017), 6–13.
- 8[8] Abo Khamis, M., Ngo, H. Q., and Suciu, D. What do shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another? Co RR abs/1612.02503 (2016).
