Learning Models over Relational Data using Sparse Tensors and Functional   Dependencies

Mahmoud Abo Khamis; Hung Q. Ngo; XuanLong Nguyen; Dan Olteanu; and Maximilian Schleich

arXiv:1703.04780·cs.DB·February 10, 2020

Learning Models over Relational Data using Sparse Tensors and Functional Dependencies

Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich

PDF

TL;DR

This paper presents a unified framework and system for efficient relational data analytics using sparse tensors and functional dependencies, enabling faster and accurate statistical model training over databases.

Contribution

It introduces a theoretical framework combining database theory and linear algebra for structure-aware learning, and implements the AC/DC system demonstrating significant performance improvements.

Findings

01

AC/DC achieves comparable accuracy to existing tools.

02

AC/DC is up to 1000 times faster in typical applications.

03

AC/DC handles large datasets without memory or timeouts.

Abstract

Integrated solutions for analytics over relational databases are of great practical importance as they avoid the costly repeated loop data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and train the desired model using this tool. These integrated solutions are also a fertile ground of theoretically fundamental and challenging problems at the intersection of relational and statistical data models. This article introduces a unified framework for training and evaluating a class of statistical learning models over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and…

Tables2

Table 1. Table 1: Key Characteristics of the three used datasets.

	Retailer	Favorita	Yelp
Relations	4	5	5
Variables	21	14	26
Categorical Variables	4	10	6
Tuples in Database	87M	125M	8.7M
Size of Database	1.5GB	2.5GB	0.2GB
Tuples in Join Result	86M	125M	360M
Size of Join Result	18GB	7GB	40GB
#values in Listing Representation	2.302G	1.735G	2.835G
#values in Factorized Representation	166M	372M	71.9M
Compression ( Factorized/Listing )	13.91 $\times$	4.66 $\times$	39.43 $\times$

Table 2. Table 2: Time performance (seconds) for learning LR , PR , and FaMa models over Retailer, Favorita, and Yelp. The timeout is set to 24 hours (86,400 seconds). R and MADlib do not support FaMa models. TensorFlow does not support PR and FaMa models.

Linear Regression
		Retailer v₁	Retailer v₂	Favorita v₁	Favorita v₂	Yelp
Join Computation (PSQL)		447.76	447.76	255.16	255.16	195.26
Factorized Computation of Count over Join		19.90	19.90	36.45	36.45	21.07
Features (continuous+categorical)		16 + 49	16+3,661	4 + 482	4 + 4,482	21 + 1,068
Number of Entries in Sparse Tensor		1,149	69,777	42,504	455,889	46,401
MADLib (ols)	Encode	0.19	8.46	532.52	545.42	61.79
	Learn	1,124.84	$>$ 86,400.00	13,951.44	$>$ 86,400.00	44,307.88
TensorFlow (ftrl)	Join+Shuffle+Export	2,266.76	2,266.76	1,417.87	1,417.87	1,110.41
(1 epoch, batch size 100K)	Learn	3,420.91	3,408.34	3,649.73	3,648.52	5,763.88
AC/DC	Aggregate	40.09	118.47	115.02	1,022.44	42.89
	Converge	0.11	285.92	0.94	22.20	0.14
Speedup of AC/DC over	MADlib	27.99 $\times$	$> 213.67 \times$	$124.90 \times$	$> 83.23 \times$	1,031.13 $\times$
	TensorFlow	$141.48 \times$	$14.03 \times$	$43.70 \times$	$4.85 \times$	159.76 $\times$
Polynomial Regression degree $2$
Features (continuous+categorical)		121+980	121+65,996	7+42,016	7+451,403	211+41,559
Number of Entries in Sparse Tensor		72,790	2,517,600	498,641	6,616,551	6,478,164
MADlib (ols)	Encode	0.19	8.46	532.52	545.42	61.79
	Learn	$>$ 86,400.00	$>$ 86,400.00	$>$ 86,400.00	$>$ 86,400.00	$>$ 86,400.00
AC/DC	Aggregate	122.88	334.35	324.99	7,549.99	3,650.47
	Converge	21.92	621.69	45.34	1,063.88	203.64
Speedup of AC/DC over	MADlib	$> 596.69 \times$	$> 90.38 \times$	$> 234.74 \times$	$> 10.09 \times$	$> 22.43 \times$
Factorization Machine degree $2$ rank $8$
Features (continuous+categorical)		107+980	107+65,996	5+42,016	5+451,403	192+41,559
Number of Entries in Sparse Tensor		70,515	2,465,443	497,786	6,607,696	6,454,053
libFM (MCMC)	Join+Ex/Import+Encode	3,368.06	3,368.06	3,214.79	3,214.79	2,719.48
(300 iterations)	Learn	$>$ 86,400.00	$>$ 86,400.00	$>$ 86,400.00	$>$ 86,400.00	67,829.59
AC/DC	Aggregate	124.17	324.492	351.68	7,856.88	3,633.93
(300 iterations)	Converge	0.67	42.74	2.60	25.38	265.94
Speedup of AC/DC over	libFM	$> 719.06 \times$	$> 244.45 \times$	$> 252.95 \times$	$11.37 \times$	$18.09 \times$

Equations407

Q (sku, store, day, color, quarter, city, country, unitsSold, price, size)

Q (sku, store, day, color, quarter, city, country, unitsSold, price, size)

\leftarrow Sales (sku, store, day, unitsSold), Items (sku, color, p r i ce),

Quarter (day, quarter), Stores (store, city, s i z e), Country (city, country) .

\frac{\partial ⟨ b , x ⟩}{\partial x}

\frac{\partial ⟨ b , x ⟩}{\partial x}

\frac{\partial x ^{⊤} Ax}{\partial x}

\frac{\partial ∥ Ax - b ∥ _{2}^{2}}{\partial x}

\frac{\partial u ^{⊤} v}{\partial x}

\frac{\partial ( Bx + b ) ^{⊤} C ( Dx + d )}{\partial x}

ψ (X_{1}, \dots, X_{r}, Y_{1}, \dots, Y_{s}) = ψ_{A} (X_{1}, \dots, X_{r}) ψ_{B} (Y_{1}, \dots, Y_{s}) .

ψ (X_{1}, \dots, X_{r}, Y_{1}, \dots, Y_{s}) = ψ_{A} (X_{1}, \dots, X_{r}) ψ_{B} (Y_{1}, \dots, Y_{s}) .

A ⋆ B = ∣ a_{1} ∣ ∣ a_{2} ∣ \dots \dots \dots ∣ a_{n} ∣ ⋆ ∣ b_{1} ∣ ∣ b_{2} ∣ \dots \dots \dots ∣ b_{n} ∣ = ∣ ∣ a_{1} \otimes b_{1} ∣ ∣ ∣ ∣ a_{2} \otimes b_{2} ∣ ∣ \dots \dots \dots \dots \dots ∣ ∣ a_{n} \otimes b_{n} ∣ ∣ .

A ⋆ B = ∣ a_{1} ∣ ∣ a_{2} ∣ \dots \dots \dots ∣ a_{n} ∣ ⋆ ∣ b_{1} ∣ ∣ b_{2} ∣ \dots \dots \dots ∣ b_{n} ∣ = ∣ ∣ a_{1} \otimes b_{1} ∣ ∣ ∣ ∣ a_{2} \otimes b_{2} ∣ ∣ \dots \dots \dots \dots \dots ∣ ∣ a_{n} \otimes b_{n} ∣ ∣ .

J (θ) = (x, y) \in D \sum L (⟨ g (θ), h (x) ⟩, y) + Ω (θ) .

J (θ) = (x, y) \in D \sum L (⟨ g (θ), h (x) ⟩, y) + Ω (θ) .

J (θ) = \frac{1}{2∣ D ∣} (x, y) \in D \sum (⟨ g (θ), h (x) ⟩ - y)^{2} + \frac{λ}{2} θ_{2}^{2} .

J (θ) = \frac{1}{2∣ D ∣} (x, y) \in D \sum (⟨ g (θ), h (x) ⟩ - y)^{2} + \frac{λ}{2} θ_{2}^{2} .

\frac{1}{2∣ D ∣} (x, y) \in D \sum i = 0 \sum n θ_{i} x_{i} + {i, j} \in (2 [ n ]) ℓ \in [r] \sum θ_{i}^{(ℓ)} θ_{j}^{(ℓ)} x_{i} x_{j} - y^{2} + \frac{λ}{2} θ_{2}^{2} .

\frac{1}{2∣ D ∣} (x, y) \in D \sum i = 0 \sum n θ_{i} x_{i} + {i, j} \in (2 [ n ]) ℓ \in [r] \sum θ_{i}^{(ℓ)} θ_{j}^{(ℓ)} x_{i} x_{j} - y^{2} + \frac{λ}{2} θ_{2}^{2} .

h_{S} (x)

h_{S} (x)

g_{S} (θ)

∥ θ ∥_{2} = 1 max θ^{⊤} Σ θ = θ \in R^{p} max λ \in R min θ^{⊤} Σ θ + λ (∥ θ ∥_{2}^{2} - 1),

∥ θ ∥_{2} = 1 max θ^{⊤} Σ θ = θ \in R^{p} max λ \in R min θ^{⊤} Σ θ + λ (∥ θ ∥_{2}^{2} - 1),

h_{j} (x) = f \in V ⨂ x_{f}^{\otimes a_{j} (f)}

h_{j} (x) = f \in V ⨂ x_{f}^{\otimes a_{j} (f)}

h_{j} (x) = f \in V_{j} - C_{j} \prod x_{f}^{a_{j} (f)} \cdot f \in C_{j} ⨂ x_{f} .

h_{j} (x) = f \in V_{j} - C_{j} \prod x_{f}^{a_{j} (f)} \cdot f \in C_{j} ⨂ x_{f} .

h_{1} (x)

h_{1} (x)

h_{2} (x)

h_{3} (x)

(vietnam, england, a, b, c, red, green, blue) .

(vietnam, england, a, b, c, red, green, blue) .

h_{1, vietnam} (x)

h_{1, vietnam} (x)

h_{1, england} (x)

h_{2, vietnam, red} (x)

h_{2, vietnam, green} (x)

h_{2, vietnam, blue} (x)

h_{2, england, red} (x)

h_{2, england, green} (x)

h_{2, england, blue} (x)

Σ

Σ

c

s_{Y}

J (θ)

J (θ)

\displaystyle\mbox{\boldmath$\nabla$}J(\vec{\theta})

p r i ce

p r i ce

J (θ)

J (θ)

\displaystyle\mbox{\boldmath$\nabla$}J(\vec{\theta})

\mbox{\boldmath$\nabla$}J(\vec{\theta}-\alpha\mathbf{d})=(1-\alpha)\mathbf{d}-\alpha\vec{\Sigma}\mathbf{d}.

\mbox{\boldmath$\nabla$}J(\vec{\theta}-\alpha\mathbf{d})=(1-\alpha)\mathbf{d}-\alpha\vec{\Sigma}\mathbf{d}.

α θ^{⊤} Σ d - \frac{α ^{2}}{2} d^{⊤} Σ d - α ⟨ c, d ⟩ + λ α ⟨ θ, d ⟩ \leq \frac{α}{2} (λ α + 1) ∥ d ∥_{2}^{2} .

α θ^{⊤} Σ d - \frac{α ^{2}}{2} d^{⊤} Σ d - α ⟨ c, d ⟩ + λ α ⟨ θ, d ⟩ \leq \frac{α}{2} (λ α + 1) ∥ d ∥_{2}^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\xpatchcmd→

[1] $\boldsymbol{$ }{#1}$

Learning Models over Relational Data using Sparse Tensors and Functional Dependencies

Mahmoud Abo Khamis

RelationalAI, Inc

Hung Q. Ngo

RelationalAI, Inc

XuanLong Nguyen

University of Michigan

Dan Olteanu

University of Oxford

Maximilian Schleich

University of Oxford

Abstract

Integrated solutions for analytics over relational databases are of great practical importance as they avoid the costly repeated loop data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and train the desired model using this tool. These integrated solutions are also a fertile ground of theoretically fundamental and challenging problems at the intersection of relational and statistical data models.

This article introduces a unified framework for training and evaluating a class of statistical learning models over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. We show that, by synergizing key tools from database theory such as schema information, query structure, functional dependencies, recent advances in query evaluation algorithms, and from linear algebra such as tensor and matrix operations, one can formulate relational analytics problems and design efficient (query and data) structure-aware algorithms to solve them.

This theoretical development informed the design and implementation of the AC/DC system for structure-aware learning. We benchmark the performance of AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting and advertisement planning applications, AC/DC can learn polynomial regression models and factorization machines with at least the same accuracy as its competitors and up to three orders of magnitude faster than its competitors whenever they do not run out of memory, exceed 24-hour timeout, or encounter internal design limitations.

1 Introduction

Although both disciplines of databases and statistics occupy foundational roles for the emerging field of data science, they are largely seen as complementary. Most fundamental contributions made by statisticians and machine learning researchers are abstracted away from the underlying infrastructure for data management. However, there is undoubtedly clear value in tight integration of statistics and database models and techniques. This is receiving an increasing interest in both academia and industry [2, 42, 61]. This is motivated by the realization that in many practical cases data used for training resides inside relational databases and bringing the analytics closer to the data saves non-trivial time usually spent on data import/export at the interface between database systems and statistical packages [38]. A complementary realization is that large chunks of statistical machine learning code can be expressed as relational queries and computed using database techniques [27, 43, 66, 24].

The problem of solving analytics over databases naturally lends itself to a systematic investigation using the toolbox of concepts and techniques developed by the database theorist, and by synergizing ideas from both relational and statistical data modeling. One can exploit database schema information, functional dependencies, state-of-the-art query evaluation algorithms, and well-understood complexity analysis.

Contributions

Our conceptual contribution is the introduction of a framework for training and evaluating a class of statistical learning models over relational databases. This class, commonly used in retail-planning and forecasting applications [11], includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. In such applications, the training dataset is the result of a feature extraction query over the database. Typical databases include weekly sales data, promotions, and product descriptions. A retailer would like to compute a parameterized model, which can predict, for instance, the additional demand generated for a given product due to promotion [50]. The feature extraction query is commonly a natural join of the database relations, yet it may join additional relations derived from the input ones using aggregation. The features correspond to database attributes, their categorical values, or aggregates over them. As is prevalent in practical machine learning, the models are trained using a first-order optimization algorithm such as batch or stochastic gradient descent, in part because their convergence rates are dimension-free (for well-behaved objectives). This is a crucial property given the high-dimensionality of our problem as elaborated next.

The main computational challenge posed by analytics over databases is the large number of records and of features in the training dataset. There are two types of features: continuous (quantitative) such as price and sales; and categorical (qualitative) such as colors, cities, and countries.111Most of the raw features we observed in datasets for retail applications are categorical. In several domains, such as statistical arbitrage [45], it is common to derive many continuous features from categorical features. While continuous features allow for aggregation over their domains, categorical features cannot be aggregated together. To accommodate the latter, the state-of-the-art approach is to one-hot encode their active domain: each value in the active domain of an attribute is encoded by an indicator vector whose dimension is the size of the domain. For instance, the colors in the domain $\{\mbox{red},\mbox{green},\mbox{blue}\}$ can be represented by indicator vectors $[1,0,0]$ for red, $[0,1,0]$ for green, and $[0,0,1]$ for blue. The one-hot encoding amounts to a relational representation of the training dataset with one new attribute per distinct category of each categorical feature and with wide tuples whose values are mostly 0. This entails huge redundancy due to the presence of the many 0 values. The one-hot encoding also blurs the usual distinction between schema and data, since the schema can become as large as the input database.

Closely related to the computational challenge is a cultural challenge: the feasibility of a tight integration of analytics and databases may be called into question. In terms of pure algorithmic performance, why would such an approach be more efficient than the common approach that decouples the computation of the training dataset from the learning task, given the widely available plethora of tools and techniques for the latter?

Our answer to these challenges is that, for a large class of feature extraction queries, it is possible to train a model in time sub-linear in the output size of the feature extraction query! This makes our approach competitive regardless of the learning techniques used by the mainstream approaches that first materialize the training dataset, including those that use sampling and stochastic gradient descent to only process a subset of the training dataset.

More concretely, our approach entails three database-centric technical contributions.

First, we exploit join dependencies and their factorization in the training dataset to asymptotically improve the per-iteration computation time of a gradient descent algorithm.

Second, we exploit functional dependencies present in the database to reduce the dimensionality of the underlying optimization problem by only optimizing for those parameters that functionally determine the others and by subsequently recovering the functionally determined parameters using their dependencies.

Third, we address the shortcomings of one-hot encoding by expressing the sum-product aggregates used to compute the gradient and point evaluation as functional aggregate queries (FAQs) [8]. The aggregates over continuous features are expressed as FAQs without free (i.e., group-by) variables and their computation yields scalar values. In contrast, aggregates over categorical features originating from a set $S$ of database attributes are expressed as FAQs with free variables $S$ . The tuples in the result of such FAQs are combinations of categorical values that occur in the training dataset. The ensemble of FAQs defining the gradient form a sparse tensor representation and computation solution with lower space and time complexity than solutions based on one-hot encoding. In particular, the complexity of our end-to-end solution can be arbitrarily smaller than that of materializing the result of the feature extraction query.

The above three technical contributions led to the design and implementation of AC/DC, a gradient descent solver for polynomial regression models and factorization machines over databases. To train such models of up to 66K features over the natural join of all relations from a real-world dataset of up to 86M tuples, AC/DC needs up to 15 minutes on eight cores of a commodity machine. AC/DC is up to 1,031 times faster than its competitors MadLib [38], libFM [64], and TensorFlow [1] whenever they do not exceed memory limitation, 24-hour timeout, or internal design limitations.

Figure 1 depicts schematically the workflows of our approach and the mainstream approach for solving optimization problems. The mainstream approach materializes the result of the feature extraction query, exports it out of the database and imports it as the training dataset in the ML tool, where the desired model is learned. We call it structure-agnostic since it does not exploit the relational structure of the underlying training dataset to avoid learning over the full materialization of the training dataset. In contrast, our structure-aware approach avoids this materialization and has the following steps: (1) it defines a set of aggregates needed to compute the gradient of the objective function for the desired model; (2) it optimizes these aggregates over the feature extraction query and under dependencies holding in the database and join dependencies defined by the feature extraction query; (3) it computes these aggregates in bulk using factorization techniques and exploiting subexpressions common among them; and (5) it uses a gradient descent solver to compute the model parameters based on the computed aggregates.

This article brings together and extends two lines of our prior work: The theoretical development of model reparameterization under functional dependencies and of factorized learning [5] and a preliminary report on the design and implementation of AC/DC [4]. The extensions concern the treatment of PCA, simplified proof for model reparameterization under functional dependencies, different experiments, and a classification of the existing landscape of structure-aware versus structure-agnostic approaches to analytics.

Organization

The structure of the paper follows our contributions. Section 2 introduces preliminary notions needed throughout the article. Section 3 describes our unified framework for structure-aware analytics. Section 4 introduces our sparse tensor representation and computation approach for square loss problems (learning polynomial regression models and factorization machines) and principal component analysis together with its complexity analysis. Section 5 shows how to exploit functional dependencies to reduce the dimensionality of learned models. Section 6 discusses the design and implementation of the AC/DC system for learning models over relational databases. Section 7 presents our experimental findings. Section 8 overviews several strands of related work. Finally, Section 9 lists promising directions for future work. Further preliminaries and proofs of some theorems are deferred to the electronic appendix.

2 Preliminaries

We use the following notational conventions: bold face letters, e.g., $\mathbf{x}$ , $\vec{\theta}$ , $\mathbf{x}_{i}$ , $\vec{\theta}_{j}$ , denote vectors or matrices, and normal face letters, e.g., $x_{i}$ , $\theta_{j}$ , $\theta^{(j)}_{i}$ , denote scalars. For any positive integer $n$ , $[n]$ denotes the set $\{1,\dots,n\}$ . For any set $S$ and positive integer $k$ , $\binom{S}{k}$ denotes the collection of all $k$ -subsets of $S$ . Let $S$ be a finite set and Dom be any domain, then $\mathbf{a}_{S}=(a_{j})_{j\in S}\in\textsf{Dom}^{|S|}$ is a tuple indexed by $S$ , whose components are in Dom. If $S$ and $T$ are disjoint, and given tuples $\mathbf{a}_{S}$ and $\mathbf{a}_{T}$ , the tuple $(\mathbf{a}_{S},\mathbf{a}_{T})$ is interpreted naturally as the tuple $\mathbf{a}_{S\cup T}$ .

The tuple $\mathbf{0}_{S}$ is the all-[math] tuple indexed by $S$ . If $S\subseteq G$ , then the tuple $\mathbf{1}_{S|G}$ is the characteristic vector of the subset $S$ , i.e., $\mathbf{1}_{S|G}(v)=1$ if $v\in S$ , and [math] if $v\in G-S$ .

2.1 Feature Extraction Query

We consider the setting where the training dataset $D$ used as input to machine learning is the result of a query $Q$ called feature extraction query, over a relational database $I$ . This query is typically the natural join of the relations in the database. It is also common to join in further relations that are derived from the input relations by aggregating some of their columns. These further relations provide derived features, which add to the raw features readily provided by the input relations.

We use standard notation for query hypergraphs. Let $\mathcal{H}=(\mathcal{V},\mathcal{E})$ denote the hypergraph of the join query $Q$ , where $\mathcal{V}$ is the set of variables occurring in $Q$ and $\mathcal{E}$ is the set of hyperedges with one hyperedge per set of variables in a relation symbol $R$ in the body of $Q$ . We denote by $V\subseteq\mathcal{V}$ the subset of variables selected as features, and let $n=|V|$ . The features in $V$ corresponding to qualitative attributes are called categorical, while those corresponding to quantitative attributes are continuous. Let $N$ be the size of the largest input relation $R$ in $Q$ . Each tuple $(\mathbf{x},y)\in D$ contains a scalar response (regressand) $y$ and a tuple $\mathbf{x}$ encoding features (regressors).

Example 1.

Consider the following natural join query $Q$ that is a simplified version of a feature extraction query:

[TABLE]

Relation Sales records the number of units of a given sku (stock keeping unit) sold at a store on a particular day. The retailer is a global business, so it has stores in different cities and countries. One objective is to predict the number of blue units to be sold next year in the Fall quarter in Berlin. The response is the continuous variable $\mathit{unitsSold}$ , $\mathcal{V}$ is the set of all variables, and $V=\mathcal{V}-\{\mathit{unitsSold},\textsf{day}\}$ , all of which are categorical except $price$ and $size$ .∎

2.2 Matrix calculus

We introduce basic concepts of matrix calculus and the following operations: the Kronecker/tensor product $\otimes$ ; the Hadamard product $\circ$ ; the Khatri-Rao product $\star$ ; and the Frobenius inner product of two matrices $\left\langle\cdot,\cdot\right\rangle$ , which reduces to the vector inner product when the matrices have one column each. We defer further preliminaries on matrix calculus to Appendix A and connection of tensor computation and the FAQ framework [8] to Appendix B.

Basics

We list here common identities we often use in the paper; for more details see the Matrix Cookbook [60]. We use denominator layout for differentiation, i.e., the gradient is a column vector. Let $\mathbf{A}$ be a matrix, and $\mathbf{u},\mathbf{v},\mathbf{x},\mathbf{b}$ be vectors, where $\mathbf{A}$ and $\mathbf{b}$ are independent of $\mathbf{x}$ , and $\mathbf{u}$ and $\mathbf{v}$ are functions of $\mathbf{x}$ then

[TABLE]

The Product Cookbook: Tensor product, Kronecker product, and Khatri-Rao product

Next, we discuss some identities regarding tensors. We use $\otimes$ to denote the tensor product. When taking tensor product of two matrices, this is called the Kronecker product, which is not the same as the outer product for matrices, even though the two are isomorphic maps. If $\mathbf{A}=(a_{ij})$ is an $m\times n$ matrix and $\mathbf{B}=(b_{k\ell})$ is a $p\times q$ matrix, then the tensor product $\mathbf{A}\otimes\mathbf{B}$ is an $mp\times nq$ matrix whose $((i,k),(j,\ell))$ entry is $a_{ij}b_{k\ell}$ . In particular, if $\mathbf{x}=(x_{i})_{i\in[m]}$ is an $m$ -dimensional vector and $\mathbf{y}=(y_{j})_{j\in[p]}$ is an $p$ -dimensional vector, then $\mathbf{x}\otimes\mathbf{y}$ is an $mp$ -dimensional vector whose $(i,j)$ entry is $x_{i}y_{j}$ ; this is not an $m\times p$ matrix as in the case of the outer product. This layout is the correct layout from the definition of the tensor (Kronecker) product. If $\mathbf{A}$ is matrix, then $\mathbf{A}^{\otimes k}$ denote the tensor power $\underbrace{\mathbf{A}\otimes\cdots\otimes\mathbf{A}}_{k\text{ times}}$ .

Definition 1 (Tensor product).

Let $\mathbf{A}$ and $\mathbf{B}$ be tensors of order $r$ and $s$ respectively, i.e., functions $\psi_{A}(X_{1},\dots,X_{r})$ and $\psi_{B}(Y_{1},\dots,Y_{s})$ . The tensor product $\mathbf{A}\otimes\mathbf{B}$ is the multilinear function

[TABLE]

A matrix is a tensor of order $2$ .

Definition 2 (Khatri-Rao product).

Let $\mathbf{A}$ and $\mathbf{B}$ be two matrices each with $n$ columns. We use $\mathbf{A}\star\mathbf{B}$ to denote the matrix with $n$ columns, where the $j$ th column of $\mathbf{A}\star\mathbf{B}$ is the tensor product of the $j$ th column of $\mathbf{A}$ with the $j$ th columns of $\mathbf{B}$ . The operator $\star$ is a (special case of) the Khatri-Rao product [40], where we partition the input matrices into blocks of one column each. More elaborately, if $\mathbf{A}$ has columns $\mathbf{a}_{1},\dots,\mathbf{a}_{n}$ , and $\mathbf{B}$ has columns $\mathbf{b}_{1},\dots,\mathbf{b}_{n}$ ,then one can visualize the $\star$ operator as follows:

[TABLE]

(Note $\mathbf{A}$ and $\mathbf{B}$ do not need to have the same number of rows.)

Definition 3 (Hadamard product).

Let $\mathbf{A}=(a_{ij})$ and $\mathbf{B}=(b_{ij})$ be two $m\times n$ matrices, then the Hadamard product $\mathbf{A}\circ\mathbf{B}$ is an $m\times n$ matrix, where each $i,j$ element is given by $(\mathbf{A}\circ\mathbf{B})_{ij}=a_{ij}b_{ij}$ .

3 Problem formulation

This section introduces a general formulation for a range of machine learning tasks and then lays out a versatile mathematical representation suitable for the in-database treatment of these tasks.

3.1 Continuous features

We start with a standard formulation in machine learning, where all model features are numerical.

The training dataset $D$ , which is defined by a feature extraction query over a relational database, consists of tuples $(\mathbf{x},y)$ of a feature vector $\mathbf{x}$ and a response $y$ .

In case of continuous features, $\mathbf{x}\in\mathbb{R}^{n}$ is the vector of $n$ raw input features, or equivalently the variables in the feature extraction query. We denote by $\vec{\theta}=(\theta_{1},\dots,\theta_{p})\in\mathbb{R}^{p}$ the vector of $p$ so-called parameters. Let $m\geq n$ be an integer. We define feature and parameter maps as follows.

The feature map $h:\mathbb{R}^{n}\to\mathbb{R}^{m}$ transforms the raw input vector $\mathbf{x}$ into an $m$ -vector of “monomial features” $h(\mathbf{x})=(h_{j}(\mathbf{x}))_{j\in[m]}$ . Each component $h_{j}$ is a multivariate monomial designed to capture the interactions among dimensions of input $\mathbf{x}$ . In particular, we write $h_{j}(\mathbf{x})=\prod_{i\in[n]}x_{i}^{a_{j}(i)}$ , where degree $a_{j}(i)$ represents the level of participation of input dimension $i$ in the $j$ -th monomial feature.

The parameters $\vec{\theta}$ produce the coefficients associated with features $h$ via parameter map $g:\mathbb{R}^{p}\to\mathbb{R}^{m}$ , $g(\vec{\theta})=(g_{j}(\vec{\theta}))_{j\in[m]}$ . Each component $g_{j}$ is a multivariate polynomial of $\vec{\theta}$ .

A large number of machine learning tasks learn a functional quantity of the form $\left\langle g(\vec{\theta}),h(\mathbf{x})\right\rangle$ , where the parameters $\vec{\theta}$ are obtained by solving $\min_{\vec{\theta}}J(\vec{\theta})$ with

[TABLE]

$\mathcal{L}$ is a loss function, e.g., square loss, and $\Omega$ is a regularizer, e.g., $\ell_{1}$ - or $\ell_{2}$ -norm of $\vec{\theta}$ . For square loss and $\ell_{2}$ -regularization, $J(\vec{\theta})$ becomes:

[TABLE]

Example 2.

The ridge linear regression (LR) model with response $y$ and regressors $x_{1},\dots,x_{n}$ has $p=n+1$ , parameters $\vec{\theta}=(\theta_{0},\dots,\theta_{n})$ . For convenience, we set $x_{0}=1$ corresponding to the bias parameter $\theta_{0}$ . Then, $m=n+1$ , $\mathbf{g}(\vec{\theta})=\vec{\theta}$ , and $h(\mathbf{x})=\mathbf{x}$ . The inner product becomes $\left\langle g(\vec{\theta}),h(\mathbf{x})\right\rangle=\left\langle\vec{\theta},\mathbf{x}\right\rangle=\sum_{i=0}^{n}\theta_{i}x_{i}$ and Equation (7) becomes $J(\vec{\theta})=\frac{1}{2|D|}\sum_{(\mathbf{x},y)\in D}\left(\sum_{i=0}^{n}\theta_{i}x_{i}-y\right)^{2}+\frac{\lambda}{2}\left\|\vec{\theta}\right\|_{2}^{2}.$ ∎

Example 3.

The degree- $d$ polynomial regression ( $\text{\sf PR}^{d}$ ) model with response $y$ and regressors $x_{0}=1,x_{1},\dots,x_{n}$ has $p=m=\binom{n+d}{d}=\sum_{i=0}^{d}\binom{n+i-1}{i}$ parameters $\vec{\theta}=(\theta_{\mathbf{a}})$ , where $\mathbf{a}=(a_{1},\dots,a_{n})$ is a tuple of non-negative integers such that $\left\|\mathbf{a}\right\|_{1}\leq d$ . In this case, $g(\vec{\theta})=\vec{\theta}$ , while the components of $h$ are given by $h_{\mathbf{a}}(\mathbf{x})=\prod_{i=1}^{n}x_{i}^{a_{i}}$ .∎

Example 4.

In contrast to polynomial regression models, factorization machines [65] factorize the space of model parameters to better capture data correlations. The degree- $2$ rank- $r$ factorization machines ( $\text{\sf FaMa}^{2}_{r}$ ) model with regressors $x_{0}=1,x_{1},\dots,x_{n}$ and regressand $y$ has parameters $\vec{\theta}$ consisting of $\theta_{i}$ for $i\in\{0,\dots,n\}$ and $\theta^{(l)}_{i}$ for $i\in[n]$ and $l\in[r]$ . Training $\text{\sf FaMa}^{2}_{r}$ corresponds to minimizing the following $J(\vec{\theta})$ :

[TABLE]

This loss function follows Equation (7) with $p=1+n+rn$ , $m=1+n+\binom{n}{2}$ , and the maps

[TABLE]

∎

Example 5.

Classification methods such as support vector machines (SVM), logistic regression and Adaboost also fall under the same optimization framework, but with different choices of loss $\mathcal{L}$ and regularizer $\Omega$ . Typically, $\Omega(\vec{\theta})=\frac{\lambda}{2}\|\vec{\theta}\|_{2}^{2}$ . Restricting to binary class labels $y\in\{\pm 1\}$ , the loss function $\mathcal{L}(\gamma,y)$ , where $\gamma=\langle g(\vec{\theta}),h(\mathbf{x})\rangle$ , takes the form $\mathcal{L}(\gamma,y)=\max\{1-y\gamma,0\}$ for SVM, $\mathcal{L}(\gamma,y)=\log(1+e^{-y\gamma})$ for logistic regression and $\mathcal{L}(\gamma,y)=e^{-y\gamma}$ for Adaboost.∎

Example 6.

Various unsupervised learning techniques can be expressed as iterative optimization procedures according to which each iteration is reduced to an optimization problem of the generic form given above. For example, the Principal Component Analysis (PCA) requires solving the following optimization problem to obtain a principal component direction

[TABLE]

where $\vec{\Sigma}=\frac{1}{|D|}\sum_{\mathbf{x}\in D}\mathbf{x}\mathbf{x}^{\top}$ is the (empirical) correlation matrix of the given data. Although there is no response/class label $y$ , within each iteration of the above iteration, for a fixed $\lambda$ , there is a loss function $\mathcal{L}$ acting on feature vector $h(\mathbf{x})$ and parameter vector $g(\vec{\theta})$ , along with a regularizer $\Omega$ . Specifically, we have $h(\mathbf{x})=\vec{\Sigma}\in\mathbb{R}^{p\times p}$ , $g(\vec{\theta})=\vec{\theta}\otimes\vec{\theta}\in\mathbb{R}^{p\times p}$ , $\mathcal{L}=\langle g(\vec{\theta}),h(\mathbf{x})\rangle_{F}$ , where the Frobenius inner product is now employed. In addition, $\Omega(\vec{\theta})=\lambda(\|\vec{\theta}\|_{2}^{2}-1)$ .∎

3.2 Categorical features

The active domain of a categorical feature/variable is a set of possible values or categories, e.g., vietnam, england, and usa are possible categories of the categorical feature country. Categorical features constitute the vast majority of features we observed in machine learning applications.

It is common practice to one-hot encode categorical variables [36]. Whereas a continuous variable such as salary is mapped to a scalar value $x_{\textsf{salary}}$ , a categorical variable such as country is mapped to an indicator vector $\mathbf{x}_{\text{\sf country}}$ – a vector of binary values indicating the category that the variable takes on. For example, if the active domain of country consists of vietnam, england, and usa, then $\mathbf{x}_{\text{\sf country}}=[x_{\text{\sf vietnam}},x_{\text{\sf england}},x_{\text{\sf usa}}]\in\{0,1\}^{3}$ . If a tuple in the training dataset has country = “england”, then $\mathbf{x}_{\text{\sf country}}=[0,1,0]$ for that tuple.

In general, the feature vector $\mathbf{x}$ has the form $\mathbf{x}=(\mathbf{x}_{c})_{c\in V}$ , where each component $\mathbf{x}_{c}$ is an indicator vector if $c$ is a categorical variable and a scalar otherwise. Similarly, each component of the parameter vector $\vec{\theta}$ becomes a matrix, or a vector if the matrix has one column.

3.3 Tensor product representation

We accommodate both continuous and categorical features in our problem formulation (7) by replacing arithmetic product by tensor product in the component functions of the parameter map $g$ and the feature map $h$ . Specifically, monomials $h_{j}$ now take the form

[TABLE]

with degree vector $\mathbf{a}_{j}=(a_{j}(f))_{f\in V}\in\mathbb{N}^{n}$ . For each $j\in[m]$ , the set $V_{j}=\{f\in V\ |\ a_{j}(f)>0\}$ consists of features that participate in the interaction captured by the (hyper-) monomial $h_{j}$ . Let $C\subseteq V$ denote the set of categorical variables and $C_{j}=C\cap V_{j}$ the subset of categorical variables in $V_{j}$ . For $f\in C_{j}$ , $h_{j}$ represents $\prod_{f\in C_{j}}|\pi_{f}(D)|$ many monomials, one for each combination of the categories, where $\pi_{f}(D)$ denotes the projection of $D$ onto variable $f$ . Due to one-hot encoding, each element in the vector $\mathbf{x}_{f}$ for a categorical variable $f$ is either [math] or $1$ , and $\mathbf{x}_{f}^{a_{j}(f)}=\mathbf{x}_{f}$ for $a_{j}(f)>0$ . Hence, $h_{j}$ can be simplified as follows:

[TABLE]

Note that we use $x_{f}$ instead of boldface $\mathbf{x}_{f}$ since each variable $f\in V_{j}-C_{j}$ is continuous.

Example 7.

For illustration, consider a query that extracts tuples over schema $(\textsf{country},a,b,$ $c,\textsf{color})$ from the database, where country and color are categorical variables, while $a,b,c$ are continuous variables. Moreover, there are two countries vietnam and england, and three colors red, green, and blue in the training dataset $D$ . Consider three of the possible feature functions:

[TABLE]

Under the one-hot encoding, the schema of the tuples becomes:

[TABLE]

Equation (9) says that the functions $h_{1}$ and $h_{2}$ are actually encoding $8$ functions:

[TABLE]

∎

We elaborate the tensor product representation for the considered learning models.

Example 8.

In linear regression, parameter $\vec{\theta}$ is a vector of vectors: $\vec{\theta}=[\vec{\theta}_{0},\dots,\vec{\theta}_{n}]$ . Since our inner product is Frobenius, when computing $\left\langle\vec{\theta},\mathbf{x}\right\rangle$ we should be multiplying, for example, $\theta_{\text{\sf usa}}$ with $x_{\text{\sf usa}}$ correspondingly. ∎

Example 9.

In polynomial regression, the parameter $\vec{\theta}$ is a vector of tensors (i.e., high-dimensional matrices). Consider for instance the second order term $\theta_{ij}x_{i}x_{j}$ . When both $i$ and $j$ are continuous, $\theta_{ij}$ is just a scalar. Now, suppose $i$ is country and $j$ is color. Then, the model has terms $\theta_{\text{\sf vietnam},\textsf{red}}x_{\text{\sf vietnam}}x_{\textsf{red}}$ , $\theta_{\text{\sf usa},\textsf{green}}x_{\text{\sf usa}}x_{\textsf{green}}$ , and so on. All these terms are captured by the Frobenius inner product $\left\langle\vec{\theta}_{ij},\mathbf{x}_{i}\otimes\mathbf{x}_{j}\right\rangle$ . The component $\vec{\theta}_{ij}$ is a matrix whose number of entries is the number of pairs (country, color) that appear together in some tuple in the training dataset. This number can be much less than the product of the numbers of countries and of colors in the input database.∎

Example 10.

Consider the $\text{\sf FaMa}^{2}_{r}$ model from Example (4), but now with categorical variables. From the previous examples, we already know how to interpret the linear part $\sum_{i=0}^{n}\theta_{i}x_{i}$ of the model when features are categorical. Consider a term in the quadratic part such as $\sum_{\ell\in[r]}\theta_{i}^{(\ell)}\theta_{j}^{(\ell)}x_{i}x_{j}$ . When $i$ and $j$ are categorical, the term becomes $\left\langle\sum_{\ell\in[r]}\vec{\theta}_{i}^{(\ell)}\otimes\vec{\theta}_{j}^{(\ell)},\mathbf{x}_{i}\otimes\mathbf{x}_{j}\right\rangle.$ ∎

4 Database-centric problem reformulation

In this section, we show how we reformulate the square loss optimization problems (learning polynomial regression and factorization machine models) and PCA to encode their data-intensive components as FAQs. The ensemble of these FAQs form a sparse tensor representation and computation solution with lower space and time complexity than solutions based on one-hot encoding.

4.1 Solution for square loss problems

We introduce our approach to learning statistical models for the setting of square loss function $J(\vec{\theta})$ and $\ell_{2}$ -norm as in (7). We use a gradient-based optimization algorithm that employs the first-order gradient information to optimize the loss function $J(\vec{\theta})$ . It repeatedly updates the parameters $\vec{\theta}$ by some step size $\alpha$ in the direction of the gradient $\mbox{\boldmath$ \nabla $}J(\vec{\theta})$ until convergence. To guarantee convergence, it uses backtracking line search to ensure that $\alpha$ is sufficiently small to decrease the loss for each step. Each update step requires two computations: (1) Point evaluation: Given $\vec{\theta}$ , compute the scalar $J(\vec{\theta})$ ; and (2) Gradient computation: Given $\vec{\theta}$ , compute the vector $\mbox{\boldmath$ \nabla $}J(\vec{\theta})$ . In particular, we use the batch gradient descent (BGD) algorithm with the Armijo line search condition and the Barzilai-Borwein step size adjustment [14, 28], as depicted in Algorithm 1. Quasi-Newton optimization algorithms (e.g., L-BFGS) and other common line search conditions are also applicable in our framework. We refer the reader to the excellent review article [31] for additional details on fast implementations of gradient-descent optimization methods.

Continuous features

We first consider the case without categorical features. We rewrite the square-loss function (7) to factor out the data-dependent part of the point evaluation and gradient computation. Recall that, for $j\in[m]$ , $h_{j}$ denotes the $j$ th component function of the vector-valued function $h$ , and $h_{j}$ is a multivariate monomial in $\mathbf{x}$ .

Theorem 4.1.

Let $J(\vec{\theta})$ be the function in (7). Define the matrix $\vec{\Sigma}=(\sigma_{ij})_{i,j\in[m]}$ , the vector $\mathbf{c}=(c_{i})_{i\in[m]}$ , and the scalar $s_{Y}$ by

[TABLE]

Then,

[TABLE]

Note that $\frac{\partial g(\vec{\theta})^{\top}}{\partial\vec{\theta}}$ is a $p\times m$ matrix, and $\vec{\Sigma}$ is an $m\times m$ matrix. Statistically, $\vec{\Sigma}$ is related to the covariance matrix, $\mathbf{c}$ to the correlation between the response and the regressors, and $s_{Y}$ to the empirical second moment of the response variable. Theorem 4.1 allows us to compute the two key steps of BGD without scanning through the data again, because the quantities $(\vec{\Sigma},\mathbf{c},s_{Y})$ can be computed efficiently in a preprocessing step inside the database as aggregates over the feature extraction query $Q$ .

Example 11.

Consider the query $Q$ in Example 1, where the set of features is {sku, store, day, color, quarter, city, country, $price$ , $size$ } and unitsSold is the response variable. In this query $n=9$ , and thus for a $\text{\sf PR}_{2}$ model we have $m=1+9+\binom{9}{2}=46$ parameters. Consider two indices $i$ and $j$ to the component functions of $g$ and $h$ , where $i=(price)$ and $j=(size)$ . Then we can compute the entry $\sigma_{ij}\in\vec{\Sigma}$ with the following SQL query:

[TABLE]

∎

When $g$ is the identity function, i.e., the model is linear, as is the case in PR (and thus LR) model, (16) and (17) become particularly simple:

Corollary 4.2.

In a linear model (i.e., $g(\vec{\theta})=\vec{\theta}$ ),

[TABLE]

Let $\mathbf{d}=\mbox{\boldmath$ \nabla $}J(\vec{\theta})$ . Then,

[TABLE]

The Armijo condition $J(\vec{\theta}-\alpha\mathbf{d})\geq J(\vec{\theta})-\frac{\alpha}{2}\left\|\mathbf{d}\right\|_{2}^{2}$ becomes:

[TABLE]

The significance of (21) is as follows. In a typical iteration of BGD, we have to backtrack a few times (say $t$ times) for each value of $\alpha$ . If we were to recompute $J(\vec{\theta}-\alpha\mathbf{d})$ using (18) each time, then the runtime of Armijo backtracking search is $O(tm^{2})$ , even after we have already computed $\mathbf{d}$ and $J(\vec{\theta})$ . Now, using (21), we can compute in advance the following quantities (in this order): $\mathbf{d}$ , $\left\|\vec{\theta}\right\|_{2}^{2}$ , $\vec{\Sigma}\mathbf{d}$ , $\left\langle\mathbf{c},\mathbf{d}\right\rangle$ , $\left\langle\vec{\theta},\mathbf{d}\right\rangle$ , $\mathbf{d}^{\top}\vec{\Sigma}\mathbf{d}$ , $\vec{\theta}^{\top}\vec{\Sigma}\mathbf{d}$ . Then, each check for inequality (21) can be done in $O(1)$ -time, for a total of $O(m^{2}+t)$ -times. Once we have determined the step size $\alpha$ , (20) allows us to compute the next gradient (i.e., the next $\mathbf{d}$ ) in $O(m)$ , because we have already computed $\vec{\Sigma}\mathbf{d}$ for line search.

To implement BGD, we need to compute four quantities efficiently: the $\vec{\Sigma}$ matrix in (13), the vector $\mathbf{c}$ in (14), point evaluation in (16), and the gradient in (17). The covariance matrix and the correlation vector only have to be computed once in a pre-processing step. The gradient is computed at every iteration, which includes several point evaluations as we perform line search.222In our implementation, each iteration typically involves $1$ - $4$ backtracking steps. We do not need to compute the second moment $s_{Y}$ because optimizing $J(\vec{\theta})$ is the same as optimizing $J(\vec{\theta})-s_{Y}$ . Before describing how those four quantities can be computed efficiently, we discuss how we deal with categorical features.

Categorical features via sparse tensors

The more interesting, more common, and also considerably challenging situation is in the presence of categorical features. We next explain how we accommodate categorical features in the computation of $\vec{\Sigma}$ and $\mathbf{c}$ .

Example 12.

In Example 7, the matrix $\vec{\Sigma}$ is of size $8\times 8$ instead of $3\times 3$ after one-hot encoding. However, many of those entries are [math], for instance ( $\forall(\mathbf{x},y)\in D$ ):

[TABLE]

The reason is that the indicator variables $x_{\textsf{blue}}$ and $x_{\textsf{england}}$ act like selection clauses $x_{\textsf{color}}=\textsf{blue}$ and $x_{\textsf{country}}=\textsf{england}$ . More concretely, we can rewrite the entry $\sigma_{ij}$ as an aggregate over a more selective query. For instance, the entry that corresponds to the product of functions $h_{1,\textsf{vietnam}}(\mathbf{x})$ and $h_{2,\textsf{vietnam},\textsf{red}}(\mathbf{x})$ from Example 7 can be rewritten as follows:

[TABLE]

where $\phi=((\mathbf{x},y)\in D\wedge x_{\textsf{color}}=\textsf{red}\wedge x_{\textsf{country}}=\textsf{vietnam})$ .∎

Extrapolating straightforwardly, if we were to write $\vec{\Sigma}$ down in the one-hot encoded feature space, then the entries $\sigma_{ij}$ under one-hot encoding got unrolled into many entries. Let $C_{i}$ and $C_{j}$ be the set of categorical variables for $h_{i}$ and $h_{j}$ as defined in Section 3.3. Then, $\sigma_{ij}$ is in fact a tensor $\vec{\sigma}_{ij}$ of dimension $\prod_{f\in C_{i}}|\pi_{f}(D)|\times\prod_{f\in C_{j}}|\pi_{f}(D)|$ , because

[TABLE]

Similarly, each component $c_{j}$ of $\mathbf{c}$ defined in (14) is a tensor $\mathbf{c}_{j}$ of dimension $\prod_{f\in C_{j}}|\pi_{f}(D)|$ , because $h_{j}(\mathbf{x})$ is a tensor in the categorical case. The following follows immediately.

Theorem 4.3.

Theorem 4.1 remains valid even when some features are categorical.

Note that the outer product in (22) specifies the matrix layout of $\vec{\sigma}_{ij}$ , and so $\vec{\Sigma}$ is a block matrix, each of whose blocks is $\vec{\sigma}_{ij}$ . Furthermore, if we were to layout the tensor $\vec{\sigma}_{ij}$ as a vector, we can also write it as

[TABLE]

The previous example demonstrates that the dimensionalities of $\vec{\sigma}_{ij}$ and $\mathbf{c}_{j}$ can be very large. Fortunately, the tensors are very sparse, and a sparse representation of them can be computed with functional aggregate queries (in the FAQ-framework [8]) as shown in Proposition 4.4 below. We next illustrate the sparsity.

Example 13.

We extend the Example 11 for entries in Sigma with categorical variables. Consider two indices $i$ and $j$ to the component functions of $g$ and $h$ , where $i=(\textsf{store, city})$ and $j=(\textsf{city})$ . Suppose the query result states that the retailer has $N_{s}$ stores in $N_{c}$ countries. Then, the full dimensionality of the tensor $\vec{\sigma}_{ij}$ is $N_{s}\times N_{c}^{2}$ , because by definition it was defined to be

[TABLE]

Recall that $\mathbf{x}_{\textsf{store}}$ and $\mathbf{x}_{\textsf{city}}$ are both indicator vectors. The above tensor has the following straightforward interpretation: for every triple $(store,city_{1},city_{2})$ , where $s$ is a store and $c_{1}$ and $c_{2}$ are cities, this triple entry of the tensor counts the number of data points $(\mathbf{x},y)\in D$ for this particular combination of store and cities (divided by $1/|D|$ ). Most of these $(s,c_{1},c_{2})$ -entries are [math]. For example, if $c_{1}\neq c_{2}$ then the count is zero. Thus, we can concentrate on computing entries of the form $(s,c,c)$ :

[TABLE]

Better yet, since store functionally determines city, the number of entries in the query output is bounded by $N_{s}$ . Using relations to represent sparse tensor results in massive space saving.

We can also succinctly represent entries in $\vec{\Sigma}$ that are composed of continuous and categorical variables. Consider the entry that corresponds to dimensions $i=(\textsf{store, city})$ and $j=({\sf city},price)$ . We can compute this entry with the following SQL query:

[TABLE]

∎

4.2 Solution for Principal Component Analysis

We next consider principal component analysis (PCA) over the training dataset defined by a feature extraction query. We focus on the problem of computing the top- $K$ principal components, which correspond to the eigenvectors of the covariance matrix. Once computed, the principal components are then used to transform the data to a lower dimensional space. We show that the solution to this problem requires similar computations as our solution for square loss problems in Section 4.1.

Continuous features

We first consider the case with continuous features only. Let $\vec{\mu}=\frac{1}{|D|}\sum_{\mathbf{x}\in D}\mathbf{x}$ be the vector of means for each variable in the feature extraction query, and $\vec{\Sigma}_{1}=\frac{1}{|D|}\sum_{\mathbf{x}\in D}(\mathbf{x}-\vec{\mu})(\mathbf{x}-\vec{\mu})^{\top}$ the centered covariance matrix. The top- $K$ eigenvectors $\vec{\theta}=(\vec{\theta}_{1},\ldots,\vec{\theta}_{K})$ and the corresponding eigenvalues $\vec{\lambda}=(\lambda_{1},\ldots,\lambda_{K})$ can be computed one at a time using the min-max theorem based on the Rayleigh quotient [67]:

[TABLE]

We compute the optimal solution for the top eigenvector $\vec{\theta}_{1}$ using a gradient-based optimization algorithm, which optimizes the following loss function by alternating between performing gradient ascent with respect to $\vec{\theta}_{1}$ and gradient descent with respect to $\lambda_{1}$ until convergence:

[TABLE]

The gradient optimization steps can then be done with Algorithm 1, where the gradient of $J(\vec{\theta}_{1},\lambda_{1})$ for the two subproblems is given by:

[TABLE]

The subsequent eigenvectors are computed with the same optimization procedure but over an updated covariance matrix that subtracts all previously computed principal components. The iteration step assumes we already computed the covariance matrix $\vec{\Sigma}_{l}$ , the eigenvector $\vec{\theta}_{l}$ , and the eigenvalue $\lambda_{l}$ for $l\in[K-1]$ . The step then computes the eigenvector $\vec{\theta}_{l+1}$ and the eigenvalue $\lambda_{l+1}$ over the covariance matrix

[TABLE]

Once we computed the top- $K$ eigenvectors $\vec{\theta}$ , the projection of a training sample $\mathbf{x}\in D$ onto the lower $K$ -dimensional space is given by the inner product $\mathbf{x}^{\top}\vec{\theta}$ .

As for the square-loss problems, we can compute $\vec{\Sigma}_{1}$ once, and then compute the eigenvectors without scanning the data again. If the data is centered in a preprocessing step, then the computation of $\vec{\Sigma}_{1}$ for PCA is identical to (13) for the case of linear regression. If the data is not centered, we can compute the covariance matrix with the following reformulation:

[TABLE]

Thus, we can compute the covariance matrix by first computing the matrix from (13) where $h(\mathbf{x})=\mathbf{x}$ and then subtracting $\vec{\mu}\vec{\mu}^{\top}$ to center the data.

The gradient with respect to $\vec{\theta}$ for PCA requires the same computation over $\vec{\Sigma}_{1}$ as the gradient for linear regression models (c.f. (19)).

Categorical features via sparse tensors

PCA is based on the analysis of variance between variables, and therefore it cannot be computed directly over categorical data. It can however be meaningful to compute PCA over one-hot encoded categorical data, which would provide insights into the variance of the frequency of the co-occurrence of categories for different categorical variables. We can compute PCA over one-hot encoded categorical variables efficiently by computing it over the sparse representation of the covariance matrix, which is a variant of the sparse tensor representation of the $\vec{\Sigma}$ matrix that we introduced for the case of square-loss problems.

One difference between the square loss problems and PCA is that PCA requires its features to be linearly independent. This property is not satisfied by one-hot encoding, because it is possible to derive the indicator value for one category based on a linear combination of the indicator values for all other categories. For this reason, it is common practice to do one-hot encoding of the categorical variables for all but one category. In our problem formulation, this means that for a categorical variable $c$ , we encode the corresponding component $\mathbf{x}_{c}$ as an indicator vector whose size is the number of its categories minus one, so that we one-hot encode over all but the last category of $c$ . This encoding is often referred to as dummy encoding in many data science tools.

Another difference is the requirement to center the data. For categorical data, it is not desirable to center the data in a preprocessing step, as this would require a one-hot encoding of the input relations. To avoid the materialization of the one-hot encoding, we compute the non-centered matrix first, and then subtract $\vec{\mu}\vec{\mu}^{\top}$ , as shown in (30). The sparse representation of the covariance matrix is then a block matrix, where each entry $\vec{\sigma}_{ij}\in\vec{\Sigma}_{1}$ is defined as:

[TABLE]

The vector of means $\vec{\mu}$ has the same dimension as $\mathbf{x}$ , where for each categorical variable $c\in V$ the component $\vec{\mu}_{c}\in\vec{\mu}$ is the vector of frequencies for all but one category in the domain of $c$ :

[TABLE]

The vector $\vec{\mu}_{c}$ can be computed efficiently as a SQL count query with group-by variable $c$ . We drop the group with the lowest count and divide the count for each other group by $|D|$ .

The resulting matrix $\vec{\Sigma}_{1}$ has the same structure as the sparse tensor that is computed for linear regression problems. In fact, the quantity $\vec{\sigma}_{ij}$ in (31) is simply the centered variant of the expression in (22) for the case where $h(\mathbf{x})=\mathbf{x}$ . The centering of the $\vec{\Sigma}_{1}$ as well as updating the matrix for subsequent eigenvectors can be expressed as group-by aggregate queries and computed without materializing the quantities $\vec{\mu}\vec{\mu}^{\top}$ and $\vec{\theta}\vec{\theta}^{\top}$ .

Example 14.

Consider the entry $\vec{\sigma}_{ij}\in\vec{\Sigma}_{1}$ where $i=(\textsf{store})$ and $j=(\textsf{city})$ . We can compute the centered entry in the covariance matrix based on the non-centered entry $\vec{\sigma}_{ij}$ and the frequency vectors for store and city. The non-centered entry is computed with the SQL query in Example 13,

Let $\vec{\mu}_{s}({\sf store,val})$ and $\vec{\mu}_{c}({\sf city,val})$ be the relational encoding of the frequency vectors for store and respectively city. The relations store tuples that give for each city and respectively store the corresponding frequency that is denoted by val. We can then compute the $(i,j)$ entry in the centered covariance matrix without materializing the product of $\vec{\mu}_{s}$ and $\vec{\mu}_{c}^{\top}$ with the following SQL query:

[TABLE]

Let $\vec{\theta}_{s}({\sf store,val})$ and $\vec{\theta}_{c}({\sf city,val})$ be the relational encodings of the components in $\vec{\theta}_{1}$ that correspond to store and respectively city. We can compute the updated entry $\vec{\sigma}_{ij}\in\vec{\Sigma}_{2}$ based on (29) without materializing the product of $\vec{\theta}_{s}$ and $\vec{\theta}_{c}^{\top}$ with the following query:

[TABLE]

∎

The eigenvectors and eigenvalues can be computed on top of the sparse representation of the covariance matrix without touching the input database, and the gradient with respect to the eigenvectors requires similar computation as the gradient for square loss problems. We next show how we can compute the sparse tensor representation.

4.3 Efficient computation of the sparse tensor representation

We consider the problem of computing the sparse tensor representation for a given optimization problem. For square-loss problems the sparse tensor captures the quantities $\vec{\Sigma}$ and $\mathbf{c}$ , and for PCA we compute the non-centered covariance matrix (referred to as $\vec{\Sigma}$ for uniformity), which is then centered in a subsequent step as shown in Example 14.

An immediate approach to computing this representation is to first materialize the result of the feature extraction query $Q$ using an efficient query engine, e.g., a worst-case optimal join algorithm, and then compute the entries in the representation as aggregates over the query result. This approach, however, is suboptimal, since the listing representation of the query result is highly redundant and not necessary for the computation of the aggregates.

We employ two orthogonal observations to avoid this redundancy.

First, we use the FAQ [8] and FDB [55] frameworks for factorized computation of aggregates over joins. In a nutshell, factorized aggregate computation unifies three powerful ideas: worst-case optimal join processing, query plans defined by fractional hypertree decompositions of join queries, and pushing aggregates past joins.

Second, we exploit the observation that in the computation of $\vec{\Sigma}$ many distinct tensors $\vec{\sigma}_{ij}$ have identical sparse representations. For instance, the tensor $\vec{\sigma}_{ij}$ from Example 13 corresponding to $i=(\textsf{store, city})$ and $j=(\textsf{city})$ has the same sparse representation as any of the following tensors: $(i,j)\in\{(({\sf city,city}),{\sf store}),$ $(({\sf store,store}),{\sf city}),$ $(({\sf store,city}),{\sf store}),\ldots\}$ . This is because store and city are categorical features and taking any power of the binary values in their indicator vectors does not change these values. Furthermore, any of the two features can be in $i$ and/or $j$ .

The time complexity of computing the representation can be lower than that of materializing the result of the feature extraction query $Q$ . Let $|\vec{\sigma}_{ij}|$ denote the size (i.e., number of tuples) of the sparse representation of the $\vec{\sigma}_{ij}$ tensor. Let $\text{\sf faqw}(i,j)$ denote the FAQ*-width* of the FAQ-query333We show in the proof of Proposition 4.4 in Appendix D how to express $\vec{\sigma}_{ij}$ and $\mathbf{c}_{j}$ as FAQ-queries. that expresses the aggregate $\vec{\sigma}_{ij}$ over $Q$ ; fhtw be the fractional hypertree width of $Q$ ; and $\rho^{*}$ be the fractional edge cover number 444Due to space limitation, these width notions are defined in Appendix C. of $Q$ . Let $I$ be the input database and $D=Q(I)$ . Let $N$ be the size of the largest input relation in $Q$ , which means that $|D|=O(N^{\rho^{*}})$ . Recall that $\mathcal{V}$ is the set of query variables in $Q$ , $\mathcal{E}$ is the set of relations in $Q$ , and $m$ is the number of features. The time to compute the sparse tensor representation can be bounded as follows.

Proposition 4.4.

The tensors $\vec{\sigma}_{ij}$ and $\mathbf{c}_{j}$ can be sparsely represented by FAQ-queries with group-by variables $C_{i}\cup C_{j}$ and $C_{j}$ , respectively. They can be computed in time

[TABLE]

In case all features in $D$ are continuous, i.e., $C_{j}=\emptyset$ for all $j\in[m]$ , then $\text{\sf faqw}(i,j)=\text{\sf fhtw}$ [8] and the overall runtime becomes $O(|\mathcal{V}|^{2}\cdot|\mathcal{E}|\cdot m^{2}\cdot N^{\text{\sf fhtw}}\cdot\log N)$ . When some features are categorical, we can also bound the width $\text{\sf faqw}(i,j)$ and tensor size.

Proposition 4.5.

Let $c=\max_{i,j}|C_{i}\cup C_{j}|$ be the maximum number of categorical variables for any $\vec{\sigma}_{ij}$ . Then, $\text{\sf faqw}(i,j)\leq\text{\sf fhtw}+c-1$ and $|\vec{\sigma}_{ij}|\leq N^{\min\{\rho^{*},c\}}\}$ , $\forall i,j\in[m]$ .

For any query $Q$ with $\rho^{*}>\text{\sf fhtw}+c-1$ , there are infinitely many database instances for which

[TABLE]

Our precomputation step takes strictly sub-output-size runtime for infinitely many queries and database instances. If we were to compute $\vec{\sigma}_{ij}$ on a training dataset with categorical variables one-hot encoded, then the complexity would raise to $O(|\mathcal{V}|^{2}\cdot|\mathcal{E}|\cdot m^{2}\cdot N^{\text{\sf fhtw}+2d}\log N)$ , where $d$ is the degree of the polynomial regression model or factorization machine.

4.4 Point evaluation and gradient computation

We introduce two ideas for efficient point evaluation and gradient computation.

First, we employ a sparse representation of tensors in the parameter space. We need to evaluate the component functions of $g$ , which are polynomial. In the $\text{\sf FaMa}^{2}_{r}$ example, for instance, we evaluate expressions of the form

[TABLE]

The result is a $2$ -way tensor whose CP-decomposition (a sum of rank- $1$ tensors) is already given by (34)! There is no point in materializing the result of $g_{\textsf{store, city}}(\vec{\theta})$ and we instead keep it as is. Assuming $N_{c}$ distinct cities and $N_{s}$ distinct stores in the training dataset $D$ , if we were to materialize the tensor, then we would end up with an $\Omega(N_{c}N_{s})$ -sized result for absolutely no gain in computational and space complexity, while the space complexity of the CP-decomposition is only $O(N_{c}+N_{s})$ . This is a prime example of factorization of the parameter space.

Second, we explain how to evaluate (16) and (17) with our sparse tensor representation. The same techniques can also be applied to evaluate (26) and (27) for PCA. There are two aspects of our solution worth spelling out: (1) how to multiply two tensors, e.g., $\vec{\sigma}_{ij}$ and $g_{j}(\vec{\theta})$ , and (2) how to exploit that some tensors have the same representation to speed up the point evaluation and gradient computation.

To answer question (1), we need to know the intrinsic dimension of the tensor $\vec{\sigma}_{ij}$ . In order to compute $\vec{\Sigma}g(\vec{\theta})$ in Example 13, we need to multiply $\vec{\sigma}_{ij}$ with $g_{j}(\vec{\theta})$ for $i=(\textsf{store, city})$ and $j=(\textsf{city})$ . In a linear model, $g_{j}(\vec{\theta})=\vec{\theta}_{j}=\vec{\theta}_{\textsf{city}}$ . In this case, when computing $\vec{\sigma}_{ij}\vec{\theta}_{\textsf{city}}$ we marginalize away one city dimension of the tensor, while keeping the other two dimensions store, city. This is captured by the following query:

[TABLE]

where the tensors $\vec{\sigma}_{i,j}$ and $\vec{\theta}_{j}$ map $(\textsf{store, city})$ and respectively $(\textsf{city})$ to aggregate values. In words, $\vec{\sigma}_{ij}g_{j}(\vec{\theta})$ is computed by a group-by aggregate query where the group-by variables are precisely the variables in $C_{i}$ .

For question (2), we use the CP-decomposition of the parameter space as discussed earlier. Suppose now we are looking at the $\vec{\sigma}_{ij}$ tensor where $i=(\textsf{city})$ and $j=(\textsf{store, city})$ . Note that this tensor has the identical representation as the above tensor, but it is a different tensor. In a $\text{\sf FaMa}^{2}_{r}$ model, we would want to multiply this tensor with the component function $g_{j}(\vec{\theta})$ defined in (34) above. We do so by multiplying it with each of the terms $\vec{\theta}^{(\ell)}_{\textsf{store}}\otimes\vec{\theta}^{(\ell)}_{\textsf{city}}$ , one by one for $\ell=1,\dots,r$ , and then add up the result. Multiplying the tensor $\vec{\sigma}_{ij}$ with the first term $\vec{\theta}^{(1)}_{\textsf{store}}\otimes\vec{\theta}^{(1)}_{\textsf{city}}$ corresponds precisely to the following query:

[TABLE]

where the tensors $\vec{\sigma}_{i,j}$ , $\vec{\theta}^{(1)}_{\textsf{city}}$ , and $\vec{\theta}^{(1)}_{\textsf{store}}$ map $(\textsf{store, city})$ , $(\textsf{city})$ , and respectively $(\textsf{store})$ to aggregate values. Finally, to answer question (2), note that for the same column $j$ (i.e., the same component function $g_{j}(\vec{\theta})$ ), there can be multiple tensors $\vec{\sigma}_{ij}$ which have identical sparse representations. (This holds especially in models of degree $>1$ .)

In such cases, we have queries for point evaluation and gradient computation with identical from-where blocks but different select-group-by clauses, because the tensors have different group-by variables. Nevertheless, all such queries can share computation as we can compute the from-where clause once for all of them and then scan this result to compute each specific tensor. This analysis gives rise to the following straightforward (and conservative) estimates.

For each $j\in[m]$ , let $d_{j}$ denote the degree and $t_{j}$ denote the number of terms in the polynomial $g_{j}$ (a component function of $g$ ). Recall that $p$ is the number of parameters.

Proposition 4.6.

Point evaluation (16) and gradient computation (17) can be computed in time $O(\sum_{i,j\in[m]}$ $t_{i}t_{j}d_{i}d_{j}|\vec{\sigma}_{ij}|)$ , and respectively $O(p\sum_{i,j\in[m]}t_{i}t_{j}d_{i}d_{j}|\vec{\sigma}_{ij}|)$ .

The times for point evaluation and gradient computation are: $O(d^{2}\sum_{i,j\in[m]}|\vec{\sigma}_{ij}|)$ and respectively $O(n^{d}\sum_{i,j\in[m]}|\vec{\sigma}_{ij}|)$ for the $\text{\sf PR}^{d}$ model; $O(r^{2}d^{2}\sum_{i,j\in[m]}|\vec{\sigma}_{ij}|)$ and respectively $O(nr^{3}d^{2}\sum_{i,j\in[m]}|\vec{\sigma}_{ij}|)$ for the $\text{\sf FaMa}^{d}_{r}$ model; and $O(\sum_{i,j\in[m]}|\vec{\sigma}_{ij}|)$ and respectively $O(n\sum_{i,j\in[m]}|\vec{\sigma}_{ij}|)$ for PCA. Recall that the case for PCA is similar to that of LR, or equivalently $\text{\sf PR}_{1}$ .

Overall, there are a couple of remarkable facts regarding the overall runtime of our approach. Without loss of generality, suppose the number of iterations of BGD is bounded. (This bound is typically dimension-free, dependent on the Lipschitz constant of $J$ .) Then, from Proposition 33, there are infinitely many queries for which the overall runtime of BGD is unboundedly better than the output size. First, our approach is faster than even the data-export step of the mainstream approach that uses an external tool to train the model. Second, it is often well-agreed upon that SGD is faster than BGD. However, a single iteration of SGD requires iterating through all data tuples, which takes time at least the output size. In particular, by training the model using BGD in the factorized form, BGD can be unboundedly faster than a single iteration of SGD.

4.5 Diagram of our structure-aware approach revisited

Figure 2 refines Figure 1 and depicts key ideas behind the performance improvements of our structure-aware framework over structure-agnostic learning approaches.

In structure-agnostic learning, a query engine takes the input relations (of size $\leq N$ ) and joins them into a potentially much larger output relation of size $|D|$ , which might in turn get blown up even more inside the machine learning tool. In our structure-aware framework, the input query, the input relations, and function $h$ are first translated into FAQ [8], which is a language that is suitable for aggregate query specification and optimization. In particular, each entry $\vec{\sigma}_{i,j}$ (and $\vec{c}_{j}$ ) of our target matrix $\vec{\Sigma}$ (and vector $\vec{c}$ ) is expressed as the answer to an FAQ query $\varphi^{(i,j)}$ (or $\varphi^{(j)}$ ). All those queries are fed into an FAQ query optimizer. The optimizer factorizes each query $\varphi^{(i,j)}$ into small sub-queries $\varphi^{(i,j)}_{a},\varphi^{(i,j)}_{b},\varphi^{(i,j)}_{c},\ldots$ and solves them individually. Each sub-query results in a relation of size $\leq N^{\text{\sf faqw}}$ , which can be much smaller than the size of the data matrix $D$ . By solving the FAQ queries $\varphi^{(i,j)}$ , we obtain $\vec{\Sigma}$ and $\vec{c}$ , which are all that is needed as input for the convergence step in a batch gradient descent solver.

5 FD-Aware Optimization

In this section, we show how to exploit functional dependencies among variables to reduce the dimensionality of the optimization problem by eliminating functionally determined variables and re-parameterizing the model. We compute the quantities ( $\mathbf{\Sigma}$ , $\mathbf{c}$ ) on the subset of features that are not functionally determined, and then solve the lower-dimensional optimization problem. Finally, we recover the parameters in the original space in closed form. Exploiting functional dependencies drastically reduces the computation time for ( $\mathbf{\Sigma}$ , $\mathbf{c}$ ) and the gradient.

5.1 Introduction to the main ideas

Consider a query $Q$ with categorical variables country and city. For simplicity, assume that there are only two countries “vietnam” and “england”, and $5$ cities “saigon”, “hanoi”, “oxford”, “leeds”, and “bristol”. Under one-hot encoding, the corresponding features are encoded as indicators $x_{\text{\sf vietnam}}$ , $x_{\text{\sf england}}$ , $x_{\text{\sf saigon}}$ , $x_{\text{\sf hanoi}}$ , $x_{\textsf{oxford}}$ , $x_{\textsf{leeds}}$ , $x_{\textsf{bristol}}$ . Since city $\to$ country is a functional dependency (FD), for a given tuple $\mathbf{x}$ in the training dataset, the following hold:

[TABLE]

The first identity states that if a tuple has “vietnam” as the value for country ( $x_{\text{\sf vietnam}}=1$ ), then its value for city can only be either “saigon” or “hanoi”, i.e., $[x_{\text{\sf saigon}},x_{\text{\sf hanoi}}]$ is either $[1,0]$ or $[0,1]$ , respectively. The second identity is explained similarly.

How do we express the identities such as (35) and (36) in a formal manner in terms of the input vectors $\mathbf{x}_{\textsf{city}}$ and $\mathbf{x}_{\textsf{country}}$ ? We can extract in a preprocessing step from the database a relation of the form $R(\textsf{city},\textsf{country})$ with city as primary key. Let $N_{\textsf{city}}$ and $N_{\textsf{country}}$ be the number of cities and countries, respectively. The predicate $R(\textsf{city},\textsf{country})$ is the sparse representation of a matrix $\mathbf{R}$ of size $N_{\textsf{country}}\times N_{\textsf{city}}$ , such that if $\mathbf{x}_{\textsf{city}}$ is an indicator vector for saigon, then $\mathbf{R}\mathbf{x}_{\textsf{city}}$ is an indicator for vietnam. In this language, identities (35) and (36) can written simply as $\mathbf{x}_{\textsf{country}}=\mathbf{R}\mathbf{x}_{\textsf{city}}$ . For example, in the above particular example $N_{\textsf{city}}=5$ , $N_{\textsf{country}}=2$ , and

[TABLE]

This relationship suggests a natural idea: replace any occurrence of statistics $\mathbf{x}_{\textsf{country}}$ by its functionally determining quantity $\mathbf{x}_{\textsf{city}}$ . Since these quantities are present only in the loss function $\mathcal{L}$ via inner products $\langle g(\mathbf{x}),h(\vec{\theta})\rangle$ , such replacements result in a (typically) linear reparameterization of the loss. What happens next is less obvious, due to the presence of the nonlinear penalty function $\Omega$ . Depending on the specific structure of FDs and the choice of $\Omega$ , many parameters associated with redundant statistics, which do not affect the loss $\mathcal{L}$ , can be optimized out directly with respect to the transformed $\Omega$ penalty.

The remainder of this subsection is a gentle introduction of our idea in the presence of one simple FD in the LR model. Consider a query $Q$ in which city and country are two of the categorical features and functionally determine one another via a matrix $\mathbf{R}$ such that $\mathbf{R}\mathbf{x}_{\textsf{city}}=\mathbf{x}_{\textsf{country}}$ for all $\mathbf{x}=(\cdots,\mathbf{x}_{\textsf{city}},\mathbf{x}_{\textsf{country}},\cdots)\in D$ . We exploit this fact to “eliminate” $\mathbf{x}_{\textsf{country}}$ as follows.

[TABLE]

Define a new parameter vector $\vec{\gamma}=(\vec{\gamma}_{j})_{j\in V-\{\textsf{country}\}}$ (note that there is no $\vec{\gamma}_{\textsf{country}}$ ), and two functions $\overline{g}:\mathbb{R}^{n-1}\to\mathbb{R}^{n-1}$ , $\overline{h}:\mathbb{R}^{n}\to\mathbb{R}^{n-1}$ :

[TABLE]

Then, we can reparameterize $J(\vec{\theta})$ in terms of $\vec{\gamma}$ by

[TABLE]

Note how $\vec{\theta}_{\textsf{country}}$ has disappeared from the loss term, but it still remains in the penalty term. We now “optimize out” $\vec{\theta}_{\textsf{country}}$ by observing that

[TABLE]

By setting (41) to [math] we obtain $\vec{\theta}_{\textsf{country}}$ in terms of $\vec{\gamma}_{\textsf{city}}$ :

[TABLE]

where $\mathbf{I}_{\textsf{country}}$ is the order- $N_{\textsf{country}}$ identity matrix and similarly for $\mathbf{I}_{\textsf{city}}$ . We can thus express $J$ and its gradient completely in terms of $\vec{\gamma}$ :

[TABLE]

(Appendix E.1 contains formal proofs of the above claims.) The gradient of the loss term is computed using the matrix $\overline{\vec{\Sigma}}$ and the vector $\overline{\mathbf{c}}$ with respect to the pair $(\overline{g},\overline{h})$ of reduced dimensionality. The matrix $(\mathbf{I}_{\textsf{city}}+\mathbf{R}^{\top}\mathbf{R})$ is a rank- $N_{\textsf{country}}$ update to the identity matrix $\mathbf{I}_{\textsf{city}}$ , strictly positive definite and thus invertible. The inverse can be obtained using database aggregate queries; for numerical stability, one may compute its Cholesky decomposition which can also be expressed by aggregate queries. These “linear algebra via aggregate queries” computations are possible because our matrices admit a database interpretation, cf.Section 5.6.

5.2 Functional dependencies (FDs)

Composite FDs lead to more complex identities. For instance, the FD (guest, hotel, date) $\to$ room leads to the identity $x_{\textsf{room}}=\sum x_{\textsf{guest}}x_{\textsf{hotel}}x_{\textsf{date}}$ . Let $R$ be a relation on attributes guest, hotel, date, and room, encoding this dependency, i.e., $R$ has a compound key $(\textsf{guest},\textsf{hotel},\textsf{date})$ . Then, corresponding to $R$ there is a matrix $\mathbf{R}$ of dimension $N_{\textsf{room}}\times N_{\textsf{guest}}\cdot N_{\textsf{hotel}}\cdot N_{\textsf{date}}$ for which $\mathbf{x}_{\textsf{room}}=\mathbf{R}(\mathbf{x}_{\textsf{guest}}\otimes\mathbf{x}_{\textsf{hotel}}\otimes\mathbf{x}_{\textsf{date}})$ . Our results can be extended to the case of composite FDs, yet with a great notational burden; for the sake of clarity, we only state the results for simple FDs.

Definition 4.

An FD is simple if its left-hand side is one variable.

Let a query $Q$ in which there are $k$ disjoint groups $G_{1},\dots,G_{k}$ of features, among other features. The $i$ th group is $G_{i}=\{f_{i}\}\cup S_{i}$ , where $f_{i}$ is a feature, $S_{i}$ a set of features, and $f_{i}\to S_{i}$ is an FD. We shall refer to these as groups of simple FDs.

Example 15.

In a typical feature extraction query for retailer customers, we have $k=3$ groups (in addition to other features): the first group contains week $\to$ month $\to$ quarter $\to$ year, and thus $f_{1}$ = week and $S_{1}$ = $\{$ month, quarter, year $\}$ . In the second group, $f_{2}$ = sku and $S_{2}=\{$ type, color, size, … $\}$ (a rather large group). In the third group $f_{3}=$ store and $S_{3}=\{$ city, country, region $\}$ . ∎

For each feature $c\in S_{i}$ , let $\mathbf{R}_{c}$ denote the matrix for which $\mathbf{x}_{c}=\mathbf{R}_{c}\mathbf{x}_{f_{i}}$ . For the sake of brevity, we also define a matrix $\mathbf{R}_{f_{i}}=\mathbf{I}_{f_{i}}$ (the identity matrix of dimension equal to the active domain size of attribute $f_{i}$ ), so the equality $\mathbf{R}_{c}\mathbf{x}_{f_{i}}=\mathbf{x}_{c}$ holds for every $c\in G_{i}$ .

The linear relationship holds even if the variables are not categorical. For example, consider the FD sku $\to$ price (assuming every stock-keeping unit has a fixed sale-price). The relationship is modeled with a $1\times N_{\sf sku}$ matrix $\mathbf{R}$ , where the entry corresponding to a sku is its price. Then, $\mathbf{R}\mathbf{x}_{\textsf{sku}}=x_{\textsf{price}}$ for any indicator vector $\mathbf{x}_{\textsf{sku}}$ .

Definition 5 (FD-reduced pairs of functions).

Given a pair of functions $g$ and $h$ in our problem setting. Recall that $C_{j}$ ’s are defined in Section 3.3, while $S_{k}$ ’s are given in Definition 4. Define

[TABLE]

( $K$ is the set of component functions of $h$ containing at least one functionally determined variable.)

The group of simple FDs induces an FD-reduced pair of functions $\overline{g}:\mathbb{R}^{p-|K|}\to\mathbb{R}^{m-|K|}$ and $\overline{h}:\mathbb{R}^{n}\to\mathbb{R}^{m-|K|}$ as follows: The component functions of $\overline{h}$ are obtained from the component functions of $h$ by removing all component functions $h_{j}$ for $j\in K$ . Similarly, $\overline{g}$ is obtained from $g$ by removing all component functions $g_{j}$ for which $j\in K$ . Naturally, define the covariance matrix $\overline{\vec{\Sigma}}$ and the correlation vector $\overline{\mathbf{c}}$ as in (13) and (14), but with respect to $\overline{h}$ .

We next generalize the above technique to speedup the training of $\text{\sf PR}^{d}$ and FaMa under an arbitrary collection of simple FDs.

5.3 Polynomial regression under FDs

Recall the $\text{\sf PR}^{d}$ -model formulated in Example 3. Consider the set $A_{V}$ of all tuples $\mathbf{a}_{V}=(a_{w})_{w\in V}\in\mathbb{N}^{V}$ of non-negative integers such that $\left\|\mathbf{a}_{V}\right\|_{1}\leq d$ . For any $(\mathbf{x},y)\in D$ and $\mathbf{a}\in A_{V}$ , define $\mathbf{x}^{\otimes\mathbf{a}}=\bigotimes_{v\in V}\mathbf{x}_{v}^{\otimes a_{v}}.$ In the $\text{\sf PR}^{d}$ model we have $\vec{\theta}=(\vec{\theta}_{\mathbf{a}})_{\left\|\mathbf{a}\right\|_{1}\leq d}$ , $g(\vec{\theta})=\vec{\theta}$ , and $h_{\mathbf{a}}(\mathbf{x})=\mathbf{x}^{\otimes\mathbf{a}}$ . If a feature, say $v\in V$ , is non-categorical, then $\mathbf{x}_{v}^{\otimes a_{v}}=x_{v}^{a_{v}}$ . If we knew $x_{v}\in\{0,1\}$ , then $x_{v}^{a_{v}}=x_{v}$ and thus there is no need to have terms for which $a_{v}>1$ . A similar situation occurs when $v$ is a categorical variable. To see this, let us consider a simple query where $V=\{b,c,w,t\}$ , and all four variables are categorical. Suppose the $\text{\sf PR}^{d}$ model has a term corresponding to $\mathbf{a}=(a_{b},a_{c},a_{w},a_{t})=(0,2,0,1)$ . The term of $\left\langle\vec{\theta},h(\mathbf{x})\right\rangle$ indexed by tuple $\mathbf{a}$ is of the form

[TABLE]

For the dimensionality to match up, $\vec{\theta}_{\mathbf{a}}$ is a 3rd-order tensor, say indexed by $(i,j,k)$ . The above expression can be simplified as

[TABLE]

where the equality holds due to the fact that $\mathbf{x}_{c}(j)$ is idempotent. In particular, we only need the entries indexed by $(j,j,k)$ of $\vec{\theta}_{\mathbf{a}}$ . Equivalently, we write:

[TABLE]

Multiplying on the left by the matrix $(\mathbf{I}_{c}\star\mathbf{I}_{c})^{\top}\otimes\mathbf{I}_{t}$ has precisely the same effect as selecting out only entries $\vec{\theta}_{\mathbf{a}}(j,j,k)$ from the tensor $\vec{\theta}_{\mathbf{a}}$ . More generally, in the $\text{\sf PR}^{d}$ model we can assume that all the indices $\mathbf{a}_{V}=(a_{v})_{v\in V}$ satisfy the condition that $a_{v}\in\{0,1\}$ whenever $v$ is categorical. (This is in addition to the degree requirement that $\left\|\mathbf{a}_{V}\right\|_{1}\leq d$ .)

Given $k$ groups of FDs represented by $G_{1},\dots,G_{k}$ , let $G=\bigcup_{i=1}^{k}G_{i}$ , $S=\bigcup_{i=1}^{k}S_{i}$ , $\overline{G}=V-G$ , $\overline{S}=V-S$ , and $F=\{f_{1},\dots,f_{k}\}$ . For every non-empty subset $T\subseteq[k]$ , define $F_{T}=\{f_{i}\ |\ i\in T\}$ . Given a natural number $q<d$ , and a non-empty set $T\subseteq[k]$ with size $|T|\leq d-q$ , define the collection

[TABLE]

For every tuple $\mathbf{a}_{\overline{G}}\in\mathbb{N}^{\overline{G}}$ with $\left\|\mathbf{a}_{\overline{G}}\right\|_{1}=q<d$ and every $U\in\mathcal{U}(T,q)$ , define the following matrices, which play the same role as $\mathbf{I}_{\textsf{city}}+\mathbf{R}^{\top}\mathbf{R}$ in Section 5.1:

[TABLE]

Note that the matrices $\mathbf{B}_{\mathbf{a}_{\overline{G}},T}$ can be further factorized as each of its terms is a tensor product, but we refrain from doing so here to avoid heavy notational complexity in the proofs and the theorem statement. See (96) and (97) in the appendix for examples of what we mean by factorization of these matrices. The following theorem reparameterizes $J(\vec{\theta})$ for $\text{\sf PR}^{d}$ ( $d\geq 1$ ) to become $\overline{J}(\vec{\gamma})$ . While $\vec{\theta}=(\vec{\theta}_{\mathbf{a}})$ is a vector indexed by tuples $\mathbf{a}=\mathbf{a}_{V}\in\mathbb{N}^{V}$ , the new parameters $\vec{\gamma}=(\vec{\gamma}_{\mathbf{b}})$ are indexed by integer tuples $\mathbf{b}=\mathbf{b}_{\bar{S}}\in\mathbb{N}^{\bar{S}}$ .

Theorem 5.1.

Let the $\text{\sf PR}^{d}$ -model with parameters $\vec{\theta}=(\vec{\theta}_{\mathbf{a}_{V}})_{\left\|\mathbf{a}_{V}\right\|_{1}\leq d}$ , and $k$ groups of simple FDs $G_{i}=\{f_{i}\}\cup S_{i}$ , $i\in[k]$ . Define the reparameterization:

[TABLE]

Then, minimizing $J(\vec{\theta})$ is equivalent to minimizing the function

[TABLE]

where

[TABLE]

(Recall $\overline{\vec{\Sigma}}$ and $\overline{c}$ from Definition 5.)

The proof of this theorem (Appendix E.2) is technically involved. $\overline{J}$ is defined above with respect to the FD-reduced pair of functions $\overline{g},\overline{h}$ and a reduced parameter space of $\vec{\gamma}$ . Its gradient is simple to compute, since

[TABLE]

Moreover, once a minimizer $\vec{\gamma}$ of $\overline{J}$ is obtained, we can compute a minimizer $\vec{\theta}$ of $J$ by setting

[TABLE]

Theorem 5.1 might be a bit difficult to grasp at first glance due to its generality. To give the reader a sense of how the theorem is applied in specific instances, Appendix E.4 and E.5 present two specializations of the theorem for (ridge) linear regression ( $\text{\sf PR}^{1}$ ), and degree-2 polynomial regression ( $\text{\sf PR}^{2}$ ).

5.4 Factorization machines under FDs

We now turn our attention to $\text{\sf FaMa}^{2}_{r}$ .

Theorem 5.2.

Consider the FaMa model of degree $2$ , rank $r$ , parameters $\vec{\theta}=(\vec{\theta}_{i},(\vec{\theta}_{i}^{(\ell)})_{\ell\in[r]})_{i\in V}$ and $k$ groups of simple FDs $G_{i}=\{f_{i}\}\cup S_{i}$ , $i\in[k]$ . Let $G=\cup_{i\in[k]}G_{i}$ ,

[TABLE]

and the following reparameterization:

[TABLE]

Then, minimizing $J(\vec{\theta})$ is equivalent to minimizing the function $\overline{J}(\vec{\gamma})=\frac{1}{2}\overline{g}(\vec{\gamma})^{\top}\overline{\vec{\Sigma}}\overline{g}(\vec{\gamma})-\left\langle\overline{g}(\vec{\gamma}),\overline{\mathbf{c}}\right\rangle+\frac{\lambda}{2}\Omega(\vec{\gamma}),$ where

[TABLE]

(Recall $\overline{g}$ , $\overline{\vec{\Sigma}}$ and $\overline{c}$ from Definition 5.)

In order to optimize $\overline{J}$ with respect to $\vec{\gamma}$ , the following proposition provides a closed form formulae for the relevant gradient.

Proposition 5.3.

The gradient of $\Omega(\vec{\gamma})$ defined in (53) can be computed using $\vec{\delta}^{(\ell)}_{i}=\sum_{c\in S_{i}}\mathbf{R}_{c}^{\top}\vec{\gamma}^{(\ell)}_{c}$ , and

[TABLE]

Then,

[TABLE]

Suppose that the minimizer $\vec{\gamma}$ of $\overline{J}$ has been obtained, then a minimizer $\vec{\theta}$ of $J$ is available in closed form:

[TABLE]

This section shows that our technique applies to a non-linear model too. It should be obvious that a similar reparameterization works for $\text{\sf FaMa}^{d}_{r}$ for any $d\geq 1$ . There is some asymmetry in the reparameterization of $1$ st-order parameters $\vec{\theta}_{i}$ and $2$ nd-order parameters $\vec{\theta}^{(\ell)}_{i}$ in Theorem 5.2, because we can solve a system of linear equation with matrix inverses, but we don’t have closed form solutions for quadratic equations.

5.5 Principal Component Analysis under FDs

In this section, we show how to exploit functional dependencies to reduce the number of dimensions of the input to PCA by computing the top- $K$ eigenvectors and eigenvalues over the lower dimensional covariance matrix without the functionally determined features. We show that the eigenvalues of the lower dimensional problem are identical to those of the original problem, while the original eigenvectors can be derived from the solution to the lower dimensional problem.

Recall the functional dependencies of the form $f_{i}\to S_{i}$ , the sets $S$ of functionally determined variables and $\overline{V}=V-S$ of all other variables, as in Section 5.2. Also, recall that each $\mathbf{x}\in D$ is an $n$ -dimensional vector, and for each categorical variable $c$ , the component $\mathbf{x}_{c}$ is an indicator vector.

We define $\overline{\mathbf{x}}$ to be the vector of size $q=|\overline{V}|$ , which is obtained by removing all components from $\mathbf{x}$ that correspond to functionally determined variables (i.e., all $\mathbf{x}_{c}$ for which $c\in S$ ). Similar to (30) and (32), we can express the $q$ -dimensional vector of means and the $q\times q$ covariance matrix over $\overline{\mathbf{x}}$ :

[TABLE]

The covariance matrix $\overline{\vec{\Sigma}}_{1}$ can be computed directly over the input database as in Example 14. The effect of computing a covariance matrix over $\overline{\mathbf{x}}$ instead of $\mathbf{x}$ is that its sparse tensor representation does not require the computation of any aggregate over a functionally determined variable.

For each functionally determined variable $c\in S_{i}$ , the FD $f_{i}\to c$ induces a mapping from a component in $\overline{\mathbf{x}}$ to a component in $\mathbf{x}$ . We define $\mathbf{U}$ to be the rank- $q$ matrix of all such mappings, so that $\mathbf{x}=\mathbf{U}\overline{\mathbf{x}}$ and each index $\mathbf{u}_{kl}\in\mathbf{U}$ maps $\overline{\mathbf{x}}_{l}$ to $\mathbf{x}_{k}$ . For a variable $c_{k}$ , let $N_{k}$ be its domain size if $c_{k}$ is categorical or one otherwise. Take two such variables $c_{k}$ and $c_{l}$ . Then, if $c_{k}=c_{l}$ the entry $\mathbf{u}_{kl}$ is the identity matrix $I_{N_{k}}$ . If $c_{k}\neq c_{l}$ and there is no functional dependency between them, then the entry $\mathbf{u}_{kl}$ is the $N_{k}\times N_{l}$ matrix of zeros. In case there is a functional dependency $c_{l}\to c_{k}$ , the entry $\mathbf{u}_{kl}$ is the $N_{k}\times N_{l}$ matrix that encodes this functional dependency. For instance, in case $l=\textsf{city}$ and $k=\textsf{country}$ , the entry $\mathbf{u}_{kl}$ is the $N_{\sf country}\times N_{\sf city}$ matrix whose entries $(m,n)$ are one if the $n$ -th city is located in the $m$ -th country, or zero otherwise (as exemplified in (37) in Section 5.1, and generalized as $\mathbf{R}_{c}$ matrices in Section 5.2). We can compute a sparse representation of the matrix $\mathbf{U}$ as a collection of group-by queries over the input relations, and without materializing the result of the feature extraction query.

The following lemma shows that the eigenvalues of the covariance matrix are preserved under FDs while the eigenvectors are subject to a simple transformation.

Lemma 5.4.

For some $K\leq q$ , let $\lambda_{1},\ldots,\lambda_{K}>0$ be the top- $K$ (positive-valued) eigenvalues of $q\times q$ matrix $\mathbf{U}^{\top}\mathbf{U}\overline{\vec{\Sigma}}$ and $\vec{\eta}_{1},\ldots,\vec{\eta}_{K}\in\mathbb{R}^{q}$ be the corresponding eigenvectors. Then $\lambda_{1},\ldots,\lambda_{K}$ are also the top- $K$ eigenvalues of $\Sigma_{1}$ . Moreover, the eigenvectors of $\Sigma_{1}$ are

[TABLE]

Proof.

First note that $\vec{\mu}=\frac{1}{|D|}\sum_{\mathbf{x}\in D}\mathbf{x}=\frac{1}{|D|}\sum_{\mathbf{x}\in D}\mathbf{U}\overline{\mathbf{x}}=\mathbf{U}\overline{\vec{\mu}}$ and $\vec{\Sigma}_{1}=\frac{1}{|D|}\sum_{\mathbf{x}\in D}\mathbf{x}\mathbf{x}^{\top}-\vec{\mu}\vec{\mu}^{\top}=\frac{1}{|D|}\sum_{\mathbf{x}\in D}\mathbf{U}\overline{\mathbf{x}}\,\overline{\mathbf{x}}^{\top}\mathbf{U}^{\top}-\mathbf{U}\overline{\vec{\mu}}\,\overline{\vec{\mu}}^{\top}\mathbf{U}^{\top}=\mathbf{U}(\frac{1}{|D|}\sum_{\mathbf{x}\in D}\overline{\mathbf{x}}\,\overline{\mathbf{x}}^{\top}-\overline{\vec{\mu}}\,\overline{\vec{\mu}}^{\top})\mathbf{U}^{\top}=\mathbf{U}\overline{\vec{\Sigma}}_{1}\mathbf{U}^{\top}$ . For any eigen-pair $(\lambda,\vec{\theta})$ of $\vec{\Sigma}_{1}$ , it holds $\vec{\Sigma}_{1}\vec{\theta}=\lambda\vec{\theta}$ by definition. Thus, $\mathbf{U}\overline{\vec{\Sigma}}_{1}\mathbf{U}^{\top}\vec{\theta}=\lambda\vec{\theta}$ . Multiplying both sides by $\mathbf{U}^{\top}$ to the left, we obtain $\mathbf{U}^{\top}\mathbf{U}\overline{\vec{\Sigma}}_{1}\mathbf{U}^{\top}\vec{\theta}=\lambda\mathbf{U}^{\top}\vec{\theta}$ . Hence, $(\lambda,\mathbf{U}^{\top}\vec{\theta})$ is an eigen-pair of $\mathbf{U}^{\top}\mathbf{U}\overline{\vec{\Sigma}}_{1}$ . Since $\mathbf{U}^{\top}\mathbf{U}$ is full-ranked, the set of positive eigenvalues of $\vec{\Sigma}_{1}$ is identical to that of $\mathbf{U}^{\top}\mathbf{U}\overline{\vec{\Sigma}}_{1}$ . Moreover, let $(\lambda,\vec{\eta})$ be any of the eigen-pairs of $\mathbf{U}^{\top}\mathbf{U}\overline{\vec{\Sigma}}_{1}$ in which $\lambda>0$ , then the corresponding eigenvector of $\vec{\Sigma}_{1}$ may be obtained by the identity: $\mathbf{U}\overline{\vec{\Sigma}}_{1}\mathbf{U}^{\top}\vec{\theta}=\lambda\vec{\theta}$ , which yields $\mathbf{U}\overline{\vec{\Sigma}}_{1}\vec{\eta}=\lambda\vec{\theta}$ . This gives $\vec{\theta}=(1/\lambda)\mathbf{U}\overline{\vec{\Sigma}}_{1}\vec{\eta}$ to conclude the proof. ∎

5.6 Linear algebra with database queries

To apply the above results, we need to solve several computational primitives. The first primitive is to compute the matrix inverse $\mathbf{B}_{T,q}^{-1}$ and its product with another vector. This task can be done by either explicitly computing the inverse, or computing the Cholesky decomposition of the matrix $\mathbf{B}_{T,q}$ . We next explain how both of these tasks can be done using database queries.

Maintaining the matrix inverse with rank- $1$ updates

Using Sherman-Morrison-Woodbury formula [35], we can incrementally compute the inverse of the matrix $\mathbf{I}+\sum_{c\in G_{i}}\mathbf{R}_{c}^{\top}\mathbf{R}_{c}$ as follows. Let $S\subset G_{i}$ be some subset and suppose we have already computed the inverse for $\mathbf{M}_{S}=\mathbf{I}+\sum_{s\in S}\mathbf{R}_{s}^{\top}\mathbf{R}_{s}$ . We now explain how to compute the inverse for $\mathbf{M}_{S\cup\{c\}}=\mathbf{I}+\sum_{s\in S\cup\{c\}}\mathbf{R}_{s}^{\top}\mathbf{R}_{s}$ . For concreteness, let the matrix $\mathbf{R}_{c}$ map city to country. For each country country, let $\mathbf{e}_{\textsf{country}}$ denote the $01$ -vector where there is a $1$ for each city the country has. For example, $\mathbf{e}_{\text{cuba}}=[1\ 1\ 0\ 0\ 0]^{\top}.$ Then, $\mathbf{R}_{c}^{\top}\mathbf{R}_{c}=\sum_{\textsf{country}}\mathbf{e}_{\textsf{country}}\mathbf{e}^{\top}_{\textsf{country}}.$ And thus, starting with $\mathbf{M}_{S}$ , we apply the Sherman-Morrison-Woodbury formula for each country, such as:

[TABLE]

This update can be done with database aggregate queries, because $\mathbf{e}_{\textsf{cuba}}^{\top}\mathbf{M}^{-1}\mathbf{e}_{\textsf{cuba}}$ is a sum of entries $(i,j)$ in $\mathbf{M}^{-1}$ where both $i$ and $j$ are cities in cuba; $\mathbf{v}=\mathbf{M}^{-1}\mathbf{e}_{\textsf{cuba}}$ is the sum of columns of $\mathbf{M}^{-1}$ corresponding to cuba; and $\mathbf{M}^{-1}\mathbf{e}_{\textsf{cuba}}\mathbf{e}_{\textsf{cuba}}^{\top}\mathbf{M}^{-1}$ is exactly $\mathbf{v}\mathbf{v}^{\top}$ .

Overall, each update (56) can be done in $O(N_{\textsf{city}}^{2})$ -time, for an overall runtime of $O(N_{\textsf{city}}^{2}N_{\textsf{country}})$ . This runtime should be contrasted with Gaussian-elimination-based inverse computation time, which is $O(N_{\textsf{city}}^{3})$ . When the FDs form a chain, the blocks are nested inside one another, and thus each update is even cheaper as we do not have to access all $N_{\textsf{city}}^{2}$ entries.

Maintaining a Cholesky decomposition with rank- $k$ update

Maintaining a matrix inverse can be numerically unstable. It would be best to compute a Cholesky decomposition of the matrix, since this strategy is numerically more stable. There are known rank- $1$ update algorithms [30, 23], using strategies similar to the inverse rank- $1$ update above. A further common computational primitive is to multiply a tensor product with a vector, such as in $(\mathbf{B}_{i}^{-1}\otimes\mathbf{B}_{j}^{-1})\vec{\gamma}_{f_{i}f_{j}}$ (also expressible as aggregate queries).

5.7 Discussion

Diagram of our structure-aware learning in the presence of FDs

Figure 3 depicts the enhancements that we introduce to our framework in order to take advantage of FDs in the input database instance and reduce our previous runtime even further (but still compute the same $\vec{\theta}^{*}$ as before).

As explained earlier, computing each entry of the matrix $\vec{\Sigma}$ and of the vector $\vec{c}$ requires solving an FAQ query. However, by utilizing FDs we can filter out many of those entries as unneeded for later stages, thus significantly reducing the number of FAQ queries that we have to solve. After the filtering process, $\vec{\Sigma}$ and $\vec{c}$ shrink down to $\overline{\vec{\Sigma}}$ and $\overline{\vec{c}}$ , which we compute and feed to gradient-descent (GD). We also filter the function $g$ down to $\bar{g}$ and feed the latter to GD. Now, we run GD in the space of $\vec{\gamma}$ (instead of the original higher-dimensional space of $\vec{\theta}$ ). During each iteration of GD, in order to compute the objective function $\overline{J}(\vec{\gamma})$ and its gradient $\mbox{\boldmath$ \nabla $}\overline{J}(\vec{\gamma})$ , we need to use the matrices $\mathbf{R}$ that represent the functional dependencies. And after GD finishes, we have to convert the resulting optimal solution $\vec{\gamma}^{*}$ back into the original space to get $\vec{\theta}^{*}$ . Such conversion also requires the FD-matrices $\mathbf{R}$ .

Impact of FDs on model complexity

The prevalence of FDs presents new challenges from both computational and statistical viewpoints. On the one hand, a reasonable and well-worn rule of thumb in statistics dictates that one should always eliminate features that are functionally dependent on others, because this helps reduce both computation and model’s complexity, which in turn leads to reduced generalization error (as also noted in [44]). On the other hand, the statistical effectiveness of such a rule is difficult to gauge when the nature of dependence goes beyond linearity. In such scenarios, it might be desirable to keep some redundant variables, but only if they help construct simpler forms of regression/classification functions, leading to improved approximation ability for the model class.

It is, however, difficult to know a priori which redundant features lead to simple functions. Therefore, the problem of dimensionality reduction cannot be divorced from the model class under consideration. While this remains unsolved in general, in this work we restricted ourselves to specific classes of learning models, the complexity of which may still be varied through regularization via (non-linear) penalties. Within a regularized parametric model class, we introduced dimensionality reduction techniques (variable elimination and re-parameterization) that may not fundamentally change the model’s capacity. The reduction in the number of parameters may still help reduce the variance of parameter estimates, leading to improved generalization error guarantees.

Impact of FDs on computational complexity

Model reparameterization under FDs does not lower the data complexity from Proposition 4.4 for the computation of the sparse tensor representation. Under a simple FD $A\rightarrow B$ , the number of categories of the functionally determined categorical variable $B$ cannot exceed that of the functionally determining categorical variable $A$ . This means that by avoiding the computation of aggregates involving $B$ , the data complexity for the computation of the sparse tensor representation with both $A$ and $B$ is the same as with $A$ only.

Computing less aggregates means however a reduction in the query complexity. In case only $q<n$ variables functionally determine the entire set of $n$ variables, the dimensionality of $\overline{\vec{\Sigma}}$ for $\text{\sf PR}^{d}$ is $\Theta(q^{d})\times\Theta(q^{d})$ , which is much smaller than the dimensionality $\Theta(n^{d})\times\Theta(n^{d})$ of $\vec{\Sigma}$ . This reduction can be significant: In one of our experiments with $\text{\sf PR}_{2}$ on the Retailer dataset v4, there is a reduction from 46M to 36M entries in the sparse tensor representation of $\vec{\Sigma}$ and $\vec{c}$ . Proposition E.1 in Appendix E.3 provides the corresponding version of Corollary 21 with respect to $\overline{\vec{\Sigma}}$ .

This reduction in the query complexity comes at a price: The gradient solver has a new data-dependent computation in the regularizer. For instance, under the functional dependency $\textsf{city}\rightarrow\textsf{country}$ used in Section 5.1, $\vec{\theta}_{\textsf{country}}={(\mathbf{I}_{\textsf{country}}+\mathbf{R}\mathbf{R}^{\top})^{-1}\mathbf{R}}\vec{\gamma}_{\textsf{city}}$ where $\mathbf{R}$ is a matrix that maps between cities and countries in the input database. Computing this linear algebra expression takes time $O(N_{\textsf{city}}^{2}N_{\textsf{country}})$ as explained in Section 5.6, where $N_{\textsf{city}}$ and $N_{\textsf{country}}$ are the number of cities (categories for the city categorical feature) and respectively countries. Assuming these quantities are small, the reduction in the number of aggregates vastly dominates the modest increase in the complexity of the additional linear algebra expression. Figure 7 in Section 7 indeed shows that using one single functional dependency for Retailer leads to a $3.5\times$ performance speedup.

6 The Design and Implementation of AC/DC

In this section, we present the design of AC/DC, which is our implementation of the algorithms and optimizations for the end-to-end computation of square loss problems presented in the previous sections. AC/DC computes each entry in the sparse tensor representation of the problem as an aggregate over the feature extraction join query, following the SQL encoding developed in previous sections, e.g., Examples 11 and 13. Two key optimizations used by AC/DC for the computation of these aggregates are: (1) Factorized computation of aggregates over the feature extraction query, with low complexity (Section 6.1); and (2) Massively shared computation across the aggregates (Section 6.2). AC/DC also exploits functional dependencies to reduce the dimensionality of the problem. By design, AC/DC does not achieve the complexity bound from Proposition 4.4. This is because it would need different query plans for different subsets of the aggregates. Instead, it uses one query plan for all these aggregates. This increases the opportunity to share computation across the aggregates, which proved much more beneficial for the overall performance.

6.1 Factorized aggregate computation

Factorized aggregate computation relies on a variable order for the query $Q$ to avoid redundant computation. In this paper, we assume that we are given a variable order. Prior work discusses the query optimization problem of finding good orders [13, 8].

Variable Orders. State-of-the-art query evaluation uses relation-at-a-time query plans. We use variable-at-a-time query plans, which we call variable orders. These are partial orders on the variables in the query, capture the join dependencies in the query, and dictate the order in which we solve each join variable. For each variable, we join all relations with that variable. Our choice is motivated by the complexity of join evaluation: Relation-at-a-time query plans are provably suboptimal, whereas variable-at-a-time query plans can be chosen to be optimal [53].

For a query $Q$ , a variable $X$ depends on a variable $Y$ if both are in the schema of a relation in $Q$ .

Definition 1 (adapted from [56]).

A variable order $\Delta$ for a join query $Q$ is a pair $(F,dep)$ , where $F$ is a rooted forest with one node per variable in $Q$ , and $dep$ is a function mapping each variable $X$ to a set of variables in $F$ . It satisfies the following constraints:

•

For each relation in $Q$ , its variables lie along the same root-to-leaf path in $F$ .

•

For each variable $X$ , $dep(X)$ is the subset of its ancestors in $F$ on which the variables in the subtree rooted at $X$ depend.

Without loss of generality, we use variables orders that are trees instead of forests. We can convert a forest into a tree by adding to each relation the same dummy join variable that takes a single value. For a variable $X$ in the variable order $\Delta$ , $anc(X)$ is the set of all ancestor variables of $X$ in $\Delta$ . The set of variables in $\Delta$ (schema of a relation $R$ ) is denoted by $vars(\Delta)$ ( $vars(R)$ respectively) and the variable at the root of $\Delta$ is denoted by $root(\Delta)$ .

Example 16.

Figure 6(a) shows a variable order for the natural join of relations $R(A,B,C)$ , $T(B,D)$ , and $S(A,E)$ . Then, $anc(D)=\{A,B\}$ and $dep(D)=\{B\}$ , i.e., $D$ has ancestors $A$ and $B$ , yet it only depends on $B$ . Given $B$ , the variables $C$ and $D$ are independent of each other. For queries with group-by variables, we choose a variable order where these variables sit above the other variables [13]. ∎

Figure 4 presents the AC/DC algorithm for factorized computation of SQL aggregates over the feature extraction query $Q$ . The backbone of the algorithm without the code in boxes explores the factorized join of the input relations $R_{1},\ldots,R_{d}$ over a variable order $\Delta$ of $Q$ . As it traverses $\Delta$ in depth-first preorder, it assigns values to the query variables. The assignments are kept in varMap and used to compute aggregates by the code in boxes.

The relations are sorted following a depth-first pre-order traversal of $\Delta$ . Each call takes a range $[x_{i},y_{i}]$ of tuples in each relation $R_{i}$ . Initially, these ranges span the entire relations. Once the root variable $A$ in $\Delta$ is assigned a value $a$ from the intersection of possible $A$ -values from the input relations, these ranges are narrowed down to those tuples with value $a$ for $A$ .

To compute an aggregate over the variable order $\Delta$ rooted at $A$ , we first initialize the aggregate to zeros. This is needed since the aggregates might have been used earlier for different assignments of ancestor variables in $\Delta$ . We next check whether we previously computed the aggregate for the same assignments of variables in $dep(A)$ , denoted by context, and cached it in a map $\text{cache}_{A}$ . Caching is useful when $dep(A)$ is strictly contained in $anc(A)$ , since this means that the aggregate computed at $A$ does not need to be recomputed for distinct assignments of variables in $anc(A)\setminus dep(A)$ . In this case, we probe the cache using as key the assignments in varMap of the $dep(A)$ variables: $\text{cache}_{A}[\text{context}]$ . If we have already computed the aggregates over that assignment for $dep(A)$ , then we can just reuse the previously computed aggregates and avoid recomputation.

If $A$ is a group-by variable, then we compute a map from each $A$ -value $a$ to a function of $a$ and aggregates computed at children of $A$ , if any. If $A$ is not a group-by variable, then we compute a map from the empty value $()$ to such a function; in this latter case, we could have just computed the aggregate instead of the map though we use the map for uniformity. In case there are group-by variables under $A$ , the computation at $A$ returns maps whose keys are tuples over all these group-by variables in $vars(\Delta)$ .

Example 17.

Consider a feature extraction query $Q$ with the variable order $\Delta$ in Figure 6(a). We first compute the assignments for $A$ as $Q_{A}=\pi_{A}R\bowtie\pi_{A}T$ . For each assignment $a\in Q_{A}$ , we then find assignments for variables under $A$ within the narrow ranges of tuples that contain $a$ . The assignments for $B$ in the context of $a$ are given by $Q^{a}_{B}=\pi_{B}(\sigma_{A=a}R)\bowtie\pi_{B}S$ . For each $b\in Q^{a}_{B}$ , the assignments for $C$ and $D$ are given by $Q^{a,b}_{C}=\pi_{C}(\sigma_{A=a\land B=b}R)$ and $Q^{b}_{D}=\pi_{D}(\sigma_{B=b}S)$ . Since $D$ depends on $B$ and not on $A$ , the assignments for $D$ under a given $b$ are repeated for every occurrence of $b$ with assignments for $A$ . The assignments for $E$ given $a\in Q_{A}$ are computed as $Q^{a}_{E}=\pi_{E}(\sigma_{A=a}T)$ .

Consider the aggregate $\texttt{COUNT}(Q)$ . The count at each variable $X$ is computed as the sum over all value assignments of $X$ of the product of the counts at the children of $X$ in $\Delta$ ; if $X$ is a leaf in $\Delta$ , the product at children is considered 1. For our variable order, this computation is captured by the following factorized expression:

[TABLE]

where $V_{D}(b)=\sum_{d\in Q^{b}_{D}}\!\!1$ is cached the first time we encounter the assignment $b$ for $B$ and reused for all subsequent occurrences of this assignment under assignments for $A$ .

Summing all $X$ -values in the result of $Q$ for a variable $X$ is done similarly, with the difference that at the variable $X$ in $\Delta$ we compute the sum of the values of $X$ weighted by the product of the counts of their children. For instance, the aggregate $\texttt{SUM}(C*E)$ is computed over our variable order by the following factorized expression:

[TABLE]

To compute the aggregate $\texttt{SUM}(C*E)\texttt{ GROUP BY }A$ , we compute $\texttt{SUM}(C*E)$ for each assignment for $A$ instead of marginalizing away $A$ . The result is a map from $A$ -values to values of $\texttt{SUM}(C*E)$ . ∎

A good variable order may include variables that are not explicitly used in the optimization problem. This is the case of join variables whose presence in the variable order ensures a good factorization. For instance, if we remove the variable $B$ from the variable order in Figure 6(a), the variables $C,D$ are no longer independent and we cannot factorize the computation over $C$ and $D$ . AC/DC exploits the conditional independence enabled by $B$ , but computes no aggregate over $B$ if this is not required in the problem.

The complexity bound in Proposition 4.4 is achieved by factorizing the computation of each aggregate in $\vec{\Sigma}$ over a variable order that has all group-by variables for this aggregate above all other variables. Thus, different aggregates can be computed over different variable orders.

6.2 Shared computation of aggregates

Section 6.1 explains how to factorize the computation of one aggregate in $\vec{\Sigma}$ , $\vec{c}$ , and $s_{Y}$ over the join of database relations. In this section we show how to share the computation across aggregates.

Example 18.

We consider the factorized expression of the aggregates $\texttt{SUM}(C)$ and $\texttt{SUM}(E)$ over $\Delta$ :

[TABLE]

We can share computation across the expressions (57) to (60) since they are similar. For instance, given an assignment $b$ for $B$ , all these aggregates need $V_{D}(b)$ . Similarly, for a given assignment $a$ for $A$ , the aggregates (58) and (60) can share the computation of the sum aggregate over $Q_{E}^{a}$ . For assignments $a\in Q_{A}$ and $b\in Q^{a}_{B}$ , expressions (58) and (59) can share the computation of the sum aggregate over $Q_{C}^{a,b}$ . ∎

To share as much computation as possible between aggregates, AC/DC computes all aggregates together over a single variable order, which significantly improves the data locality of the aggregate computation. This approach does not follow Proposition 4.4 that assumes that each aggregate is computed over its respective best variable order. AC/DC thus decidedly sacrifices the goal of achieving the lowest-known complexity for individual aggregates for the sake of sharing as much computation as possible across these aggregates.

Aggregate Decomposition and Registration.

For a model of degree $degree$ and a set of variables $\{A_{l}\}_{l\in[n]}$ , we have aggregates of the form $\texttt{SUM}(\prod_{l\in[n]}A_{l}^{d_{l}})$ , possibly with a group-by clause, such that $0\leq\sum_{l\in[n]}d_{l}\leq 2\cdot degree$ , $d_{l}\geq 0$ , and all categorical variables are turned into group-by variables. The reason for $2\cdot degree$ is due to the $\vec{\Sigma}$ matrix used to compute the gradient of the loss function (17), which pairs any two features of degree up to $degree$ . Each aggregate is thus defined uniquely by a monomial $\prod_{l\in[n]}A_{l}^{d_{l}}$ ; we may discard the variables with exponent 0. For instance, the monomial for $\texttt{SUM}(C*E)$ is CE while for $\texttt{SUM}(C*E)\text{ GROUP BY }A$ is ACE.

Aggregates can be decomposed into shareable components. Consider a variable order $\Delta=A(\Delta_{1},\ldots,\Delta_{k})$ , with root $A$ and subtrees $\Delta_{1}$ to $\Delta_{k}$ . We can decompose any aggregate $\alpha$ to be computed over $\Delta$ into $k+1$ aggregates such that aggregate [math] is for $A$ and aggregate $j\in[k]$ is for $root(\Delta_{j})$ . Then $\alpha$ is computed as the product of its $k+1$ components. Each of these aggregates is defined by the projection of the monomial of $\alpha$ onto $A$ or $vars(\Delta_{j})$ . The aggregate $j$ is then pushed down the variable order and computed over the subtree $\Delta_{j}$ . If the projection of the monomial is empty, then the aggregate to be pushed down is $\texttt{SUM}(1)$ , which computes the size of the join defined by $\Delta_{j}$ . If several aggregates push the same aggregate to the subtree $\Delta_{j}$ , this is computed only once for all of them.

The decomposed aggregates form a hierarchy whose structure is that of the underlying variable order $\Delta$ . The aggregates at a variable $X$ are denoted by $\mathsf{aggregates}_{X}$ . All aggregates are to be computed at the root of $\Delta$ , then fewer are computed at each of its children and so on. This structure is the same regardless of the input data and can be constructed before data processing. We therefore construct at compile time for each variable $X$ in $\Delta$ an aggregate register $\mathcal{R}_{X}$ that is an array of all aggregates to be computed over the subtree of $\Delta$ rooted at $X$ . This register is used as an index structure to facilitate the computation of the actual aggregates. More precisely, an entry for an aggregate $\alpha$ in the register of $X$ is labeled by the monomial of $\alpha$ and holds an array of indices of the components of $\alpha$ located in the registers at the children of $X$ in $\Delta$ and in the local register $\Lambda_{X}$ of $X$ . Figure 5 depicts this construction.

The hierarchy of registers forms an index structure that is used by AC/DC to compute the aggregates. This index structure is stored as one contiguous array in memory, where the entry for an aggregate $\alpha$ in the register comes with an auxiliary array with the indices of $\alpha$ ’s aggregate components. The aggregates are ordered in the register so that we increase sequential access, and thus cache locality, when updating them.

Example 19.

Let us compute a regression model of degree $1$ over a dataset defined by the join of the relations $R(A,B,C),S(B,D)$ , and $T(A,E)$ . We assume that $B$ and $E$ are categorical features, and all other variables are continuous. The quantities ( $\vec{\Sigma}$ , $\vec{c}$ , $s_{Y}$ ) require the computation of the following aggregates: $\texttt{SUM}(1)$ , $\texttt{SUM}(X)$ for each variable $X$ , and $\texttt{SUM}(X*Y)$ for each pair of variables $X$ and $Y$ .

Figure 6(a) depicts a variable order $\Delta$ for the natural join of three relations, and Figure 6(b) illustrates the aggregate register that assigns a list of aggregates to each variable in $\Delta$ . The aggregates are identified by their respective monomials (the names in the register entries). The categorical variables are shown in bold. Since they are treated as group-by variables, we do not need aggregates whose monomials include categorical variables with exponents higher than 1. Any such aggregate is equivalent to the aggregate whose monomial includes the categorical variable with degree 1 only.

The register $\mathcal{R}_{A}$ for the root $A$ of $\Delta$ has all aggregates needed to compute the model. The register $\mathcal{R}_{B}$ has all aggregates from $\mathcal{R}_{A}$ defined over the variables in the subtree of $\Delta$ rooted at $B$ . The variables $C$ , $D$ , and $E$ are leaf nodes in $\Delta$ , so the monomials for the aggregates in the registers $\mathcal{R}_{C}$ , $\mathcal{R}_{D}$ , and $\mathcal{R}_{E}$ are the respective variables only. We use two additional registers $\Lambda_{A}$ and $\Lambda_{B}$ , which hold the aggregates corresponding to projections of the monomials of the aggregates in $\mathcal{R}_{A}$ , and respectively $\mathcal{R}_{B}$ , onto $A$ , respectively $B$ . For a leaf node $X$ , the registers $\Lambda_{X}$ and $\mathcal{R}_{X}$ are the same.

A path between two register entries in Figure 6(b) indicates that the aggregate in the register above uses the result of the aggregate in the register below. For instance, each aggregate in $\mathcal{R}_{B}$ is computed by the product of one aggregate from $\Lambda_{B}$ , $\mathcal{R}_{C}$ , and $\mathcal{R}_{D}$ . The fan-in of a register entry thus denotes the amount of sharing of its aggregate: All aggregates from registers above with incoming edges to this aggregate share its computation. For instance, the aggregates with monomials AB, AC, and AD from $\mathcal{R}_{A}$ share the computation of the aggregate with monomial A from $\Lambda_{A}$ as well as the count aggregate from $\mathcal{R}_{E}$ . Their computation uses a sequential pass over the register $\mathcal{R}_{B}$ . This improves performance and access locality as $\mathcal{R}_{B}$ can be stored in cache and accessed to compute all these aggregates. ∎

Aggregate Computation.

Once the aggregate registers are in place, we can ingest the input database and compute the aggregates over the join of the database relations following the factorized structure given by a variable order. The algorithm in Figure 4 does precisely this. Section 6.1 explained the factorized computation of a single aggregate over the join. We explain here the case of several aggregates organized into the aggregate registers. This is stated by the pseudocode in the red boxes.

Each aggregate is uniformly stored as a map from tuples over their categorical variables to payloads that represent the sums over the projection of its monomial on all continuous variables. If the aggregate has no categorical variables, the key is the empty tuple.

For each possible $A$ -value $a$ , we first compute the array $\lambda_{A}$ that consists of the projections of the monomials of the aggregates onto $A$ . If $A$ is categorical, then we only need to compute the 0 and 1 powers of $a$ . If $A$ is continuous, we need to compute all powers of $A$ from 0 to $2\cdot degree$ . If $A$ is not a feature used in the model, then we only compute a trivial count aggregate.

We update the value of each aggregate $\alpha$ using the index structure depicted in Figure 5 as we traverse the variable order bottom up. Assume we are at a variable $A$ in the variable order. In case $A$ is a leaf, the update is only a specific value in the local register $\lambda_{A}$ . In case the variable $A$ has children in the variable order, the aggregate is updated with the Cartesian product of all its component aggregates, i.e., one value from $\lambda_{A}$ and one aggregate for each child of $A$ . The update value can be expressed in SQL as follows. Assume the aggregate $\alpha$ has group-by variables $C$ , which are partitioned across $A$ and its $k$ children. Assume also that $\alpha$ ’s components are $\alpha_{0}$ and $(\alpha_{j})_{j\in[k]}$ . Recall that all aggregates are maps, which we may represent as relations with columns for keys and one column $P$ for payload. Then, the update to $\alpha$ is:

[TABLE]

Further Considerations.

The auxiliary arrays that provide the precomputed indices of aggregate components within registers speed up the computation of the aggregates. Nevertheless, they still represent one extra level of indirection since each update to an aggregate would first need to fetch the indices and then use them to access the aggregate components in registers that may not be necessarily in the cache. We have been experimenting with an aggressive aggregate compilation approach that resolves all these indices at compile time and generates the specific code for each aggregate update. In experiments with linear regression, this compilation leads to a 4 $\times$ performance improvements. However, the downside is that the AC/DC code gets much larger and the C++ compiler needs much more time to compile it. For higher-degree models, it can get into situations where the C++ compiler crashes. We are currently working on a hybrid approach that partially resolves the indices while maintaining a reasonable code size.

Point Evaluation, Gradient Computation and FD Optimization

For the computation of the point evaluation and gradient computation we use the optimizations we introduced in Section 4.4. Recall that two entries in $\vec{\Sigma}$ can have the identical representation, which implies that a single aggregate can be used in distinct products over different components of $g$ . In order to avoid keep track of which aggregates correspond to which entries in $\Sigma$ , we construct for each aggregate a list of index pairs $(i,j)$ for each $\vec{\sigma}_{ij}\in\vec{\Sigma}$ that require this aggregate. AC/DC then uses the index list and the aggregate computed at the root of the variable order to compute the queries for point evaluation and gradient computation that were presented in Section 4.4. Consider the matrix vector product $\mathbf{p}=\vec{\Sigma}g(\vec{\theta})$ , which is needed for gradient computation. Let $A$ be the root of the variable order $\Delta$ . We compute $\mathbf{p}$ by iterating over all aggregate maps $\alpha\in\mathsf{aggregates}_{A}$ , and for each index pair $(i,j)$ that is assigned to $\alpha$ , we add to the $i$ ’th component of $\mathbf{p}$ the product of $\alpha$ and $g_{j}(\vec{\theta})$ . If $i\neq j$ , we also add to $j$ ’s component of $\mathbf{p}$ with the product of $\alpha$ and $g_{i}(\vec{\theta})$ .

For the FD optimization, it is required to construct the $\mathbf{R}_{c}$ matrices that were introduced in Section 5.2. In AC/DC, we represent these matrices as maps that group the values for functionally determining variables by the values that they determine. The maps are sparse representations of $\mathbf{R}_{c}$ matrices, and they are populated during the computation of the factorized aggregates over the variable order. We choose this representation because it allows for the efficient computation of the matrix $I+\mathbf{R}_{c}^{\top}\mathbf{R}_{c}$ , which is the basic building block the matrix $\mathbf{B}_{T,q}$ from (47). The reparameterization of the regularizer requires the computation of the inverse of $\mathbf{B}_{T,q}$ . Therefore, we store the matrix $\mathbf{B}_{T,q}$ as a sparse matrix in the format used by the Eigen linear algebra library [34], and then use Eigen’s Sparse Cholesky Decomposition to compute the inverse of $\mathbf{B}_{T,q}$ .

7 Experiments

We report on the performance of learning regression and factorization machine models over three real datasets used in retail and advertisement applications. We benchmark AC/DC against state-of-the-art competitors. AC/DC can compute the models up to $1,031\times$ faster, while the accuracy of AC/DC’s models is always at least as good as the competitor’s models. For all experiments, we assume that the model specification is given as input, and all systems compute the same model. We do not consider the orthogonal problem of finding the best model for a given analytics task.

7.1 Experimental setup

All experiments were performed on an Intel(R) Core(TM) i7-4770 3.40GHz/64bit/32GB with Ubuntu 18.04, g++7.4, and eight cores. We report wall-clock times by running each system once and then reporting the average of four subsequent runs with warm cache. We do not report the times to load the database into memory for the join. All relations are sorted by their join variables.

Competitors

We benchmark AC/DC against six competitors: MADlib [38] 1.16, libFM [64] 1.4.2, TensorFlow [1] 1.13.1, R [63] 3.4.4, scikit-learn [58] 0.20, and Python Statsmodels [68]. We evaluate the performance of learning the models over a training dataset, and then compute the root-mean-squared-error (RMSE) over a separate test dataset to compare the accuracy.

MADlib, R, scikit-learn, and Python StatsModels use ols (ordinary least squares) to compute the closed-form solution of regression models, and TensorFlow uses the LinearRegressor estimator with ftrl optimization [48], which is based on the conventional SGD optimization algorithm. TensorFlow was compiled from source to enable specialized optimizations that are native to our machine, including AVX optimizations. We use PostgreSQL 10.9 to compute the feature extraction query for libFM and TensorFlow. LibFM supports factorization machines.

AC/DC uses the gradient descent algorithm with the adaptive learning rate from Algorithm 1. For LR and PR models, we run the optimization algorithm until the RMSE over the training dataset changes by less than $10^{-15}$ in three consecutive iterations. For FaMa models, we use 300 iterations.

The aggregate computation in AC/DC is parallelized on eight cores. Tensorflow also uses multiple threads by default. MADlib and libFM only use a single thread.

The competitors come with various limitations that affect their scalability. MADlib requires the explicit one-hot encoding of the input relations, for which we use its predefined functions.

LibFM requires as input a zero-suppressed encoding of the one-hot encoded join result. Computing this representation is an expensive intermediate step between exporting the query result from the database system and importing the data into libFM. We learn the FaMa models using the MCMC optimization algorithm with a fixed number of runs (300); its SGD implementation requires a fixed learning rate and does not converge.

TensorFlow uses a user-defined iterator interface to load a batch of tuples from the training dataset at a time. This iterator defines a mapping from input tuples to features and is called directly by the learning algorithm. To avoid the explicit one-hot encoding of the features, TensorFlow encodes categorical features using a hash function. Learning over batches requires a random shuffling of the input data, which in TensorFlow amounts to loading the entire dataset into memory. This failed for our experiments due to the large sizes of the datasets. We therefore shuffle the data in PostgreSQL instead and provide the shuffled input to TensorFlow. We benchmark TensorFlow for LR only as it does not provide functionality to create all pairwise interaction terms for PR and FaMa. The optimal batch size for our experiments is 100,000 tuples. Smaller batch sizes require loading too many batches, very large batches cannot fit into memory. Since TensorFlow uses a fixed number of iterations, we report the times to optimize with one epoch over the training dataset. This means that the algorithm learns over each data point in the training dataset once. Our experiments show that it is often necessary to optimize with several epochs to learn a good model.

R, scikit-learn, and Python Statsmodels fail to compute our models due to design limitations. R limits the number of values in their data frames to $2^{31}-1$ , which is insufficient to represent out datasets. Scikit-learn and Python Statsmodels both run out of memory for all considered models.

Datasets

We experimented with three real-world datasets: (1) Retailer [66] is used by a large retailer for forecasting user demand and sales; (2) Favorita [26] is a public dataset used for retail forecasting; and (3) Yelp is based on the Yelp Dataset Challenge [70] and used to predict ratings by a user for a business. The structure and size of these datasets is common in retail and advertising, where data is easily generated by sales transactions or click streams. The feature extraction query for each dataset is the natural join of the input relations. For each dataset, we consider a subset of the variables. Table 1 presents key characteristics for each dataset. It shows that the join result can be orders of magnitude larger than the input database.

Retailer has a star schema with one fact table Inventory, which keeps track of the number of inventory units for products (sku) in a store (locn) at a given date, and three dimension tables: (1) Location keeps additional information for each store, including its size in sqft and distances to three competitors; (2) Items provides identifiers for the item category, subcategory, category cluster, as well as the price for each sku; and (3) Weather keeps information about the weather conditions for each store at a given date (including maximum temperature, and whether it rained). We design two versions of our dataset. The version v1 includes all variables but sku, date, and locn as features, and has no functional dependencies. Version v2 extends v1 with the categorical variable sku, and exploits the functional dependency sku $\to$ {category, subcategory, categoryCluster}. We learned LR, $\text{\sf PR}_{2}$ , and $\text{\sf FaMa}^{8}_{2}$ models that predict the amount of inventory units based on all other features. The test data constitutes the inventory in the last month in the dataset (approx. 2.2% of the data). This simulates the realistic usecase where the ML model predicts future inventory.

Favorita has a star schema with one fact table and 4 dimension tables. The fact table Sales stores the number of units sold for items for a given date and store, and an indicator whether or not the unit was on promotion at this time. Items keeps additional information about the skus, including the item class and whether it is perishable. Stores provides additional information about the stores, such as the city and state they are located in, and the type of store. Transactions gives the number of transactions at each store on a given date. Oil keeps the oil price for each date. Our models predict the number of units sold. We designed two variants for this dataset. Version v1 learns the model over all variables except sku and date, and version v2 extends v1 with sku. We exploit the functional dependency store $\to$ {city, state, storetype} in both variants. The test data constitutes the sales for the last month in the dataset (approx. 1.2% of the data).

Yelp has a star schema with four relations: Review stores, for each review, the rating given by user, the review date, and the number of compliments it received (e.g, useful, funny, cool); Business keeps the location (city, state, coordinates), the average rating, and the total count of reviews for each business; User keeps aggregated statistics for each user, including the number of reviews they wrote, the number of compliments they received, and the average rating they gave to businesses; Attribute keeps attributes (e.g. as “open late”) for each business. One user can review many businesses and a business can have many attributes. The result of the feature extraction query is much larger than the input relations. Our models predict the rating that users give to businesses. The models are learned over all variables except the join keys, and exploit the functional dependency city $\to$ state. The test dataset is a random selection of approximately 2% of the reviews.

7.2 Summary of findings

Our findings on the performance comparison between AC/DC and the three competitors are given in Table 2. AC/DC is the fastest system in our experiments. It can compute the models over the input database orders of magnitude faster than its competitors whenever they do not exceed memory limitation, 24-hour timeout, or internal design limitations.

AC/DC learns models with at least as good accuracy as the competitors. In particular, AC/DC’s LR models have comparable accuracy to MADlib’s closed form solution (whenever MADlib does not time out), and consistently better accuracy than the models learned by TensorFlow (trained for one epoch). With additional features, AC/DC can learn more accurate models while the competitors either timeout or fail due to internal design limitations.

The performance gap between competitors and AC/DC is primarily due to the following optimizations supported by AC/DC:

It avoids the materialization of the join and the export-import step between database systems and statistical packages, which may take longer than the end-to-end learning of LR models in AC/DC. AC/DC performs the join together with the aggregates using one execution plan; 2. 2.

It factorizes the computation of the sparse tensor and the underlying join. The compression factor brought by join factorization is 13.9 $\times$ for Retailer, $4.7\times$ for Favorita, and 21 $\times$ for Yelp; 3. 3.

It massively shares the computation of many aggregates representing entries in the sparse tensor. For instance, there are up to 2.5M such sum aggregates for $\text{\sf PR}_{2}$ on Retailer v2 and they take 150K $\times$ less time than computing the count aggregate 2.5M times, where the count takes 19.9 seconds as reported in Table 2; 4. 4.

It decouples the computation of the aggregates on the input data from the parameter convergence step and thus avoids scanning the join result for each iteration; 5. 5.

It avoids the upfront one-hot encoding that comes with higher asymptotic complexity and prohibitively large covariance matrices by only computing distinct, non-zero entries in the sparse tensor. For $\text{\sf PR}_{2}$ on Retailer v2, this leads to $868\times$ less aggregates to compute; 6. 6.

It exploits the functional dependencies in the input data to reduce the number of features of the model, which leads to an improvement factor of up to 2.3 $\times$ .

7.3 Further details

Categorical features

As we move from Retailer v1 to v2, we increase the number of categorical features by approx. $75\times$ for LR (from 49 to 3.7K) and $67\times$ for $\text{\sf PR}_{2}$ and $\text{\sf FaMa}^{8}_{2}$ (from 980 to 66K). This translates to a same-order increase in the number of aggregates: $62\times$ ( $56\times$ ) more distinct non-zero aggregates in v2 vs v1 for LR (resp. $\text{\sf PR}_{2}$ and $\text{\sf FaMa}^{8}_{2}$ ). This increase only led to a decrease in performance of AC/DC of $10\times$ for LR and $6.6\times$ for $\text{\sf PR}_{2}$ . This sub-linear behavior is partly explained by the ability of AC/DC to process many aggregates much faster in bulk than individually: it takes 19.9 seconds for one count aggregate, but only 334 seconds to compute all 2.5M entries in the sparse tensors for $\text{\sf PR}_{2}$ on v2!

For MADlib, the performance decrease is at least $78\times$ for LR when moving from v1 to v2 and it times out after 24 hours for v2. MADlib also times out for all $\text{\sf PR}_{2}$ experiments. The performance of TensorFlow is largely invariant to the increase in the number of categorical features, since its internal mapping from tuples in the training dataset to the sparse representation of the features vector remains of similar size. Nevertheless, our system is consistently faster (often by orders of magnitude) than computing only a single epoch in TensorFlow. In addition, AC/DC computes the end-to-end models faster than PSQL takes to materialize and shuffle the design matrix that is the input to TensorFlow.

One-hot encoding vs.sparse encoding with group-by aggregates.

One-hot encoding categorical features leads to a large number of zero and/or redundant entries in the $\vec{\Sigma}$ matrix. For instance, for $\text{\sf PR}_{2}$ on Retailer v2, the number of features is $m$ =66,117, and then the upper half of $\vec{\Sigma}$ would have $m(m+1)/2\approx 2.1\times 10^{9}$ entries. Most of these are either zero or repeating, as exemplified in Section 4.3. In contrast, AC/DC’s sparse representation only considers 2.5M non-zero and distinct aggregates. The number of aggregates is thus reduced by a factor of $868$ !

MADlib and libFM require the data be one-hot encoded before learning. For MADlib, the static one-hot encoding took up to 545 seconds on Favorita. In addition to the one-hot encoding, libFM requires the input data to be represented in a zero-supressed sparse format. This data transformation took up to one hour in our experiments. TensorFlow one-hot encodes on the fly using hash functions during the learning phase, which cannot be reported separately.

When running over one-hot encoded input data, AC/DC exceeds the available memory for all models but the linear regression model for Retailer v1.

Effect of functional dependencies.

Figure 7 shows the performance breakdown for AC/DC with and without exploiting FDs. All other systems do not exploit FDs.

The FDs have a twofold effect on AC/DC: they can reduce the number of features and aggregates, which leads to better performance of the aggregation step; yet it requires a more expensive convergence step due to the more complex regularizer. For LR over Retailer v2, the aggregate step becomes $2\times$ faster, while the convergence step increases $15\times$ , offsetting the effect of the faster aggregate computation. The FDs for Favorita and Yelp have a relatively smaller effect on performance for LR (there are only 54 stores in Favorita, so the reduction in the number of aggregates is small).

For $\text{\sf PR}_{2}$ models, the FD brings an improvement by a factor of $2.3\times$ for Retailer. This is due to a 18% decrease in the number of categorical features, which leads to a 38% decrease in the number of group-by aggregates. For Favorita and Yelp, the performance improvement is 1.3 $\times$ and respectively 1.2 $\times$ . For $\text{\sf FaMa}^{8}_{2}$ models, the effect of FDs is comparable to that of $\text{\sf PR}_{2}$ models.

Accuracy.

The RMSE of the LR models for Retailer $\mathsf{v}_{1}$ and Favorita $\mathsf{v}_{1}$ in AC/DC is within 1% of that for the closed form solutions computed in MADlib. By extending the models with the categorical variable sku (version $\mathsf{v}_{2}$ for both datasets), the RMSE decreases by 21% for Retailer and by 6% for Favorita. MADlib fails to learn these more accurate models, because it times out after 24 hours.

For TensorFlow, the models are trained with a single epoch. The resulting models have a consistently higher RMSE than the corresponding model computed in AC/DC. In particular, for Retailer $\mathsf{v}_{2}$ , the RMSE of the TensorFlow model is 31% higher. TensorFlow requires more epochs to achieve the same accuracy as AC/DC.

Extending the model with pairwise interactions can also improve the model accuracy. For Favorita $\mathsf{v}_{2}$ , for instance, the RMSE of the $\text{\sf PR}_{2}$ model is $3\%$ lower than the RMSE of the LR model.

The FaMa models learned by AC/DC do not reach the same accuracy as the $\text{\sf PR}_{2}$ models, and would therefore require more than 300 iterations to converge to an accurate model. This is due to the non-linear structure of factorization machines.

8 Related work

Our work follows closely Chaudhuri’s manifesto on SQL-aware data mining systems from two decades ago [19] in two key aspects. First, the goal of our work is not to invent new machine learning models or data analysis techniques, but identify common data-centric steps across a broad class of learning algorithms and investigate their theoretical and systems challenges. We show that such steps can be encoded as SQL group-by aggregate queries, which are amenable to shared batch computation. Second, our approach performs data analysis not only over materialized relations but more importantly over feature extraction queries, whose results need not be materialized. This enables the interaction between the aggregates encoding the data-centric steps and the underlying queries (this is called ad-hoc mining in Chaudhuri’s terminology).

A reevaluation of Chaudhuri’s manifesto in today’s context brings forth two important technical changes. The first game-changer is represented by the recent development on query processing. This includes a new breed of worst-case optimal join algorithms, which support listing representation [52, 69] and factorized representation [56] of query results, and extensions to aggregate computation [13, 8, 55]. These algorithms exploit developments on (fractional) hypertree decompositions of relational queries [32, 33, 47]. These algorithms overshadow the traditional query plans in both asymptotic complexity [53] and practical performance [13, 54]. The second change is in the workload. Whereas SQL-aware data mining systems were mostly concerned with association rules, decision trees, and clustering, current workloads feature a broader spectrum of increasingly more sophisticated machine learning (ML) models, including polynomial regression models, factorization machines, principal component analysis, generalized linear models, generalized low-rank models, sum-product networks, and convolutional networks. In this article, we introduce a unified approach to learning polynomial regression models, factorization machines, and principal component analysis over non-materialized feature extraction queries with the lowest known complexity and best performance to date. There is also a more profound orthogonal change: There is more data readily available in all aspects of our society and there is more appetite in industry to monetize it by turning it into knowledge.

The current landscape for analytics solutions over multi-relational data can be categorized depending on the degree of integration of the data system, which hosts the data and supports data access via query primitives, with the ML library of models and learning algorithms.

By far the most common solutions provide no integration of the two systems, which are distinct tools on the technology stack: The data system exports the training dataset as one relation, commonly presented as a CSV file, and then the ML system imports it into its own format and learns the desired model.Such solutions are structure-agnostic as they disregard the rich structural information of the materialized training dataset, including the input database schema, database dependencies, and the structure of the feature extraction queries. Prime examples are the pairing of open-source data systems such as Spark [71] or MySQL/PostgreSQL with ML systems such as R [63], Python StatsModels [68], Python Scikit [58], MLpack [22], TensorFlow [1], SystemML [39, 15], MLLib [49], and DeepDist [51]. The advantage of this approach is that the two systems can be developed independently, with virtually any ML model readily available for use.

Two disadvantages of such common solutions are the expensive data export/import at the interface between the two systems and the materialization of the training dataset as a result of a feature extraction query over multi-relational data. The feature extraction query is computed inside the data system, its result exported and imported into the data format of the ML system, where the model is learned. Furthermore, the materialized training dataset may be much larger than the input data (cf.Table 1). This is exacerbated by the stark asymmetry between the two systems: Whereas data systems tend to scale to large datasets, this is not the case for ML libraries. Yet, such solutions expect by design that the ML libraries work on even larger inputs than the data systems! A further disadvantage is that these solutions inherit the limitations of both underlying systems. For instance, the R data frame can host at most $2^{31}$ values, which makes it impossible to learn models over large datasets, even if data systems can process them. Database systems can only handle up to a few thousand columns per relation, which is usually smaller than the number of features of the model.

The second class of systems features a loose integration, even though they remain structure-agnostic: The ML code migrates inside the space of the data system process, with each ML task being implemented by a distinct user-defined aggregate function (UDAF). Prime examples of this class are MADlib [38] and GLADE PF-OLA [62]. MADlib casts analytics as UDAFs that can be used in SQL queries and executed inside PostgreSQL. GLADE PF-OLA casts analytics as a special form of UDAFs called Generalized Linear Aggregates that can be executed using the GLADE distributed engine [21]. These UDAFs remain black boxes for the underlying query engine, which has to compute the feature extraction query and delegate the UDAF computation on top of the query result to the MADLib’s and GLADE PF-OLA’s specialized code. The advantage of this approach is that the expensive export/import step is avoided. The disadvantage is that each ML task has to be migrated inside the data system space, which comes with design and implementation overhead. A further step towards integration is exemplified by Bismarck [27], which provides a unified programming architecture for many ML tasks instead of one UDAF per task, with possible code reuse across UDAFs.

The third class of systems features a tight integration and are structure-aware: There is one execution strategy for both the feature extraction query and the subsequent learning task, with components of the latter possibly pushed past the joins in the former. Prime examples are Morpheus [43], Hamlet [44], and our prior system F [66] that support generalized linear models, Naïve Bayes classification, and respectively linear regression models with continuous features. This class also contains the recent efforts on in-database linear algebra [20] and on scaling linear algebra using existing distributed database systems [46] and the declarative language BUDS [29], whose compiler can perform deep optimizations of the user’s program. Our approach AC/DC generalizes F to non-linear models, categorical features, and model reparameterization under functional dependencies. A key aspect that sets apart AC/DC and its predecessor F from prior work is that they employ execution strategies for the mixed workload of queries and learning with complexity that may be asymptotically lower than that of query materialization alone. In particular, all machine learning approaches that require as input the materialization of the result of the feature extraction query are asymptotically suboptimal. This complexity gap translates into a performance gap, cf.Section 7.

Figure 1 sums up the difference between the first two classes that fall under structure-agnostic learning and the third class that broadly represents structure-aware learning. The inspiration for our work lies with factorized computation of aggregates over joins [13, 8], which avoids the materialization of joins, and with the LogicBlox system [50, 11], which has a unified system architecture and declarative programming language for hybrid database and optimization workloads.

Beyond the above classification, there are further directions of research looking at ML through database glasses: ML-aware query languages, the effect of dependencies on model training, sparse data representations, and implementations of gradient descent solvers.

Analytical tasks can be expressed to a varying degree within query languages possibly extended with new constructs. Very recent works investigate query languages for matrices [18] and a relational framework for classifier engineering [41]. They follow works on query languages with data mining capabilities [17, 57], also called descriptive or backward-looking analytics, and on in-database data mining solutions, such as frequent itemsets [59] and association rule mining [10]. Our rewriting of ML code into aggregates falls into this line of work as well. The additional fixpoint computation needed on top of the aggregate computation for convergence of the model parameters, which is intrinsic to gradient descent approaches, can be expressed as recursive queries [3].

Functional dependencies (FDs) can be used to avoid key-foreign key joins and reduce the number of features in Naïve Bayes classification and feature selection [44]. In this article we consider the effect of FDs on the reparameterization of regression models, where a non-trivial development is on the effect of FDs on the model’s non-linear regularization function, cf.Section 5. Our factorized learning approach exploits the join dependencies present in the training dataset, as defined by the feature extraction query. This follows prior work on factorized databases [13, 56].

State-of-the-art machine learning systems use a sparse representation of the input data to avoid redundancy introduced by one-hot encoding [64, 25]. In our setting, however, such systems require an additional data transformation step after the result of the feature extraction query is exported. This additional step is time consuming and makes the use of such systems inefficient in many practical applications. In statistics and machine learning, there is a rich literature on learning with sparse and/or multilinear structures [37]. Such methods complement our framework and it would be of interest to leverage and adapt them to our setting.

Finally, there is a large collection of gradient-based methods proposed in the optimization literature. The description of our approach assumes batch gradient descent (BGD), though our insights are applicable to other methods including Quasi-Newton algorithms. The main rationale for our choice is simplicity and good statistical properties. When combined with backtracking line search (as we do in this article) or second-order gradient estimation (as in Quasi-Newton methods), BGD is guaranteed to converge to a minimum with linear asymptotic convergence rate. A naïve computation of the gradient requires a full pass over the data, which can be inefficient in large-scale analytics. A popular alternative is stochastic gradient descent (SGD), which estimates the gradient with a randomly selected mini-batch of training samples. The convergence of SGD, however, is noisy, requires careful setting of hyperparameters, and does not achieve the linear asymptotic convergence rate of BGD [16]. In our setting, the entire BGD execution can be arbitrarily faster than one SGD iteration over the result of the feature extraction query. The reason is orthogonal to properties of the two gradient descent methods: The complexity of computing the sufficient statistics needed for convergence of model parameters can be asymptotically lower than the complexity of computing the training dataset.

9 Open Problems

Our in-database learning framework raises open questions on statistics, algorithm design, and optimization. We next sketch a few representative questions.

One research direction is to further extend the class of statistical models that can be trained efficiently by exploiting the structure of the underlying relational database. Our formulation (6) captures a common class of regression models (such as PR and FaMa), classification models (such as logistic and SVM), and unsupervised learning techniques (such as principal component analysis) which is done by changing the loss function $\mathcal{L}$ . It remains open how to extend our formulation to capture latent variable models.

The aggregates defining $\vec{\Sigma}$ , $\mathbf{c}$ , point evaluation, and gradient computation are “multi-output” queries. They deserve a systematic investigation, from formulation to evaluation and complexity analysis. In practice, one often reserves a fragment of the training data for model validation. It is an interesting question to incorporate this data partitioning requirement into our framework.

Understanding how to adapt further optimization algorithms, such as coordinate descent or stochastic gradient, to our structure-aware framework is an important research direction. Furthermore, our FD-aware optimization is specific to the $\ell_{2}$ -norm in the penalty term. We would also like to understand the effect of other norms, e.g., $\ell_{1}$ , on model reparameterization under FDs.

Finally, we conjecture that the cost function may be easier to optimize with respect to the reduced set of parameters that are not functionally determined: As redundant variables are eliminated or optimized out, the cost function’s Hessian with respect to reduced parameters becomes less ill-conditioned, resulting in faster convergence behavior for gradient-based optimization techniques. The impact of FD-based dimensionality reduction, from both computational and statistical standpoints, have not been extensively studied for learning (nonlinear) models with categorical variables, which are precisely the kind discussed in our framework.

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 682588. XN is supported in part by grants NSF CAREER DMS-1351362, NSF CNS-1409303 and the Margaret and Herman Sokol Faculty Award.

Appendix A Matrix Calculus

We introduce matrix inversion formulas and identities regarding tensors and the various products introduced in Section 2.

We use the following matrix inversion formulas [35].

Proposition A.1.

We have

[TABLE]

whenever all dimensions match up and inverses on the right hand side exist. In particular, the following holds when $\mathbf{C}=(1)$ , $\mathbf{U}=\mathbf{1}$ , $\mathbf{V}=\mathbf{1}^{\top}$ , and $\mathbf{J}$ is the all- $1$ matrix:

[TABLE]

Another special case is

[TABLE]

An even more special case is the Sherman-Morrison formula, where $\mathbf{U}^{\top}$ is just a vector $\mathbf{u}$ . The matrix $\mathbf{A}+\mathbf{u}\mathbf{u}^{\top}$ is typically called a rank- $1$ update of $\mathbf{A}$ :

[TABLE]

Next, we discuss some identities involving tensor, Khatri-Rao, and Hadamard products.

Proposition A.2.

We have (if the dimensionalities match up correctly):

[TABLE]

If $\mathbf{x}$ is a standard $n$ -dimensional unit vector, $\mathbf{A}$ and $\mathbf{B}$ are two matrices with $n$ columns each, and $\mathbf{a}$ and $\mathbf{b}$ are two $n$ -dimensional vectors, then

[TABLE]

Let $\mathbf{x}$ be a standard $n$ -dimensional unit vector, $\mathbf{A}_{1},\dots,\mathbf{A}_{k}$ be $k$ matrices with $n$ columns each. Then,

[TABLE]

We note in passing that the first five identities are very useful in our dimension reduction techniques by exploiting functional dependencies, while (70), (71), and (72) are instrumental in achieving computational reduction in our handling of categorical features.

Proof.

The identities (65), (66), (67), and (68) can be found in the Matrix Cookbook [60]. Identity (69) follows from (65) and (66). To see (70), note that

[TABLE]

where the last equality follows due to the following reasoning. Suppose $x_{j}=1$ for some $j$ , then $\mathbf{A}\mathbf{x}=\mathbf{a}_{j}$ and $\mathbf{B}\mathbf{x}=\mathbf{b}_{j}$ , where $\mathbf{a}_{j}$ and $\mathbf{b}_{j}$ are the $j$ th columns of $\mathbf{A}$ and $\mathbf{B}$ , respectively. Thus,

[TABLE]

Identities (71) and (72) are proved similarly, where (72) is a trivial generalization of (70). ∎

Appendix B Tensor computation and FAQ queries

Quite often we need to compute a product of the form $(\mathbf{A}\otimes\mathbf{B})\mathbf{C}$ , where $\mathbf{A},\mathbf{B}$ , and $\mathbf{C}$ are tensors, provided that their dimensionalities match up. For example, suppose $\mathbf{A}$ is an $m\times n$ matrix, $\mathbf{B}$ a $p\times q$ matrix, and $\mathbf{C}$ a $nq\times 1$ matrix (i.e. a vector). The result is a $mp\times 1$ tensor. The brute-force way of computing $(\mathbf{A}\otimes\mathbf{B})\mathbf{C}$ is to compute $\mathbf{A}\otimes\mathbf{B}$ first, taking $\Theta(mnpq)$ -time, and then multiply the result with $\mathbf{C}$ , for an overall runtime of $\Theta(mnpq)$ . The brute-force algorithm is a horribly inefficient algorithm.

The better way to compute $(\mathbf{A}\otimes\mathbf{B})\mathbf{C}$ is to view this as an FAQ-expression [8] (a sum-product form): we think of $\mathbf{A}$ as a function $\psi_{A}(x,y)$ , $\mathbf{B}$ as a function $\psi_{B}(z,t)$ , and $\mathbf{C}$ as a function $\psi_{C}(y,t)$ . What we want to compute is the function

[TABLE]

This is a $4$ -cycle FAQ query:

We can pick between the following two evaluation strategies:

•

Eliminate $t$ first, i.e., compute $\varphi_{1}(y,z):=\sum_{t}\psi_{B}(z,t)\psi_{C}(y,t)$ in time $O(npq)$ ; then, eliminate $y$ , i.e., compute $\varphi(x,y)=\sum_{y}\varphi_{1}(y,z)\psi_{A}(x,y)$ in time $O(mnp)$ . The overall runtime is thus $O(np(m+q))$ .

•

Eliminate $y$ first and then $t$ . The overall runtime is $O(mq(n+p))$ .

This is not surprising, since the problem is just matrix chain multiplication. In the language of FAQ evaluation, we want to pick the best tree decomposition and then compute a variable elimination order out of it [8]. We shall see later that a special case of the above that occurs often is when $\mathbf{B}=\mathbf{I}$ , the identity matrix. In that case, $\psi_{B}(z,t)$ is the same as the atom $z=t$ , and thus it serves as a change of variables:

[TABLE]

In other words, we only have to marginalize out one variable instead of two. This situation arises, for example, in (50) and (51).

Appendix C overviews the InsideOut algorithm for FAQ queries and its complexity analysis.

Appendix C Widths for FAQ Queries and the InsideOut Algorithm

Background: Fractional edge cover number and output size bounds

In what follows, we consider a conjunctive query $Q$ over a relational database instance $I$ . We use $N$ to denote the size of the largest input relation in $Q$ . We also use $Q(I)$ to denote the output and $|Q(I)|$ to denote its size. We use the query $Q$ and its hypergraph $\mathcal{H}$ interchangeably.

Definition 6 (Fractional edge cover number $\rho^{*}$ ).

Let $\mathcal{H}=(\mathcal{V},\mathcal{E})$ be a hypergraph (of some query $Q$ ). Let $B\subseteq\mathcal{V}$ be any subset of vertices. A fractional edge cover of $B$ using edges in $\mathcal{H}$ is a feasible solution $\vec{\lambda}=(\lambda_{S})_{S\in\mathcal{E}}$ to the following linear program:

[TABLE]

The optimal objective value of the above linear program is called the fractional edge cover number of $B$ in $\mathcal{H}$ and is denoted by $\rho^{*}_{\mathcal{H}}(B)$ . When $\mathcal{H}$ is clear from the context, we drop the subscript $\mathcal{H}$ and use $\rho^{*}(B)$ .

Given a conjunctive query $Q$ , the fractional edge cover number of $Q$ is $\rho^{*}_{\mathcal{H}}(\mathcal{V})$ where $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is the hypergraph of $Q$ .

Theorem C.1 (AGM-bound [12, 33]).

Given a full conjunctive query $Q$ over a relational database instance $I$ , the output size is bounded by

[TABLE]

where $\rho^{*}$ is the fractional edge cover number of $Q$ .

Theorem C.2 (AGM-bound is tight [12, 33]).

Given a full conjunctive query $Q$ and a non-negative number $N$ , there exists a database instance $I$ whose relation sizes are upper-bounded by $N$ and satisfies

[TABLE]

Worst-case optimal join algorithms [69, 52, 53, 6] can be used to answer any full conjunctive query $Q$ in time

[TABLE]

Background: Tree decompositions, acyclicity, and width parameters

Definition 7 (Tree decomposition).

Let $\mathcal{H}=(\mathcal{V},\mathcal{E})$ be a hypergraph. A tree decomposition of $\mathcal{H}$ is a pair $(T,\chi)$ where $T=(V(T),E(T))$ is a tree and $\chi:V(T)\to 2^{\mathcal{V}}$ assigns to each node of the tree $T$ a subset of vertices of $\mathcal{H}$ . The sets $\chi(t)$ , $t\in V(T)$ , are called the bags of the tree decomposition. There are two properties the bags must satisfy

(a)

For any hyperedge $F\in\mathcal{E}$ , there is a bag $\chi(t)$ , $t\in V(T)$ , such that $F\subseteq\chi(t)$ .

(b)

For any vertex $v\in\mathcal{V}$ , the set $\{t\ |\ t\in V(T),v\in\chi(t)\}$ is not empty and forms a connected subtree of $T$ .

Definition 8 (acyclicity).

A hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is acyclic iff there exists a tree decomposition $(T,\chi)$ in which every bag $\chi(t)$ is a hyperedge of $\mathcal{H}$ .

When $\mathcal{H}$ represents a join query, the tree $T$ in the above definition is also called the join tree of the query. A query is acyclic if and only if its hypergraph is acyclic.

For non-acyclic queries, we often need a measure of how “close” a query is to being acyclic. To that end, we use width notions of a query.

Definition 9 ( $g$ -width of a hypergraph: a generic width notion [9]).

Let $\mathcal{H}=(\mathcal{V},\mathcal{E})$ be a hypergraph, and $g:2^{\mathcal{V}}\to\mathbb{R}^{+}$ be a function that assigns a non-negative real number to each subset of $\mathcal{V}$ . The $g$ -width of a tree decomposition $(T,\chi)$ of $\mathcal{H}$ is $\max_{t\in V(T)}g(\chi(t))$ . The $g$ -width of $\mathcal{H}$ is the minimum $g$ -width over all tree decompositions of $\mathcal{H}$ . (Note that the $g$ -width of a hypergraph is a Minimax function.)

Definition 10 (Treewidth and fractional hypertree width are special cases of $g$ -width).

Let $s$ be the following function: $s(B)=|B|-1$ , $\forall V\subseteq\mathcal{V}$ . Then the treewidth of a hypergraph $\mathcal{H}$ , denoted by $\text{\sf tw}(\mathcal{H})$ , is exactly its $s$ -width, and the fractional hypertree width of a hypergraph $\mathcal{H}$ , denoted by $\text{\sf fhtw}(\mathcal{H})$ , is the $\rho^{*}$ -width of $\mathcal{H}$ .

From the above definitions, $\text{\sf fhtw}(\mathcal{H})\geq 1$ for any hypergraph $\mathcal{H}$ . Moreover, $\text{\sf fhtw}(\mathcal{H})=1$ if and only if $\mathcal{H}$ is acyclic.

Background: Vertex/variable orderings and their equivalence to tree decompositions

Besides tree decompositions, there is another way to define acyclicity and width notions of a hypergraph, which is orderings of the hypergraph vertices. And just like we refer to queries and hypergraphs interchangeably, we also refer to query variables and hypergraph vertices interchangeably.

In what follows, we use $n$ to denote the number of vertices of the given hypergraph $\mathcal{H}$ .

Definition 11 (Vertex ordering of a hypergraph).

A vertex ordering of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is simply a listing $\sigma=(v_{1},\dots,v_{n})$ of all vertices in $\mathcal{V}$ .

Definition 12 (Elimination sets $U_{j}^{\sigma}$ of a vertex ordering $\sigma$ ).

Given a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ and a vertex ordering $\sigma=(v_{1},\dots,v_{n})$ , we define sets $U_{1}^{\sigma},\ldots,U_{n}^{\sigma}\subseteq\mathcal{V}$ , called the elimination sets of $\sigma$ , as follows: Let $\partial(v_{n})$ be the set of hyperedges of $\mathcal{H}$ that contain $v_{n}$ . We define $U_{n}^{\sigma}$ to be the union of all hyperedges in $\partial(v_{n})$ :

[TABLE]

If $n=1$ , then we are done. Otherwise, we remove vertex $v_{n}$ and all hyperedges in $\partial(v_{n})$ from $\mathcal{H}$ and add back to $\mathcal{H}$ a new hyperedge $U_{n}^{\sigma}-\{v_{n}\}$ , thus turning $\mathcal{H}$ into a hypergraph with $n-1$ vertices:

[TABLE]

The remaining elimination sets $U_{1}^{\sigma},\ldots,U_{n-1}^{\sigma}$ are defined inductively to be the elimination sets of the resulting hypergraph (whose vertices are now $\{v_{1},\ldots,v_{n-1}\}$ ).

When $\sigma$ is clear from the context, we drop the superscript $\sigma$ and use $U_{1},\ldots,U_{n}$ .

Proposition C.3 (Every vertex ordering has an “equivalent” tree decomposition [7]).

Given a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ , for every vertex ordering $\sigma$ , there is a tree decomposition $(T,\chi)$ whose bags $\chi(t)$ are the elimination sets $U_{j}^{\sigma}$ of $\sigma$ .

By applying the GYO elimination procedure [3] on the bags of any given tree decomposition, we can obtain an “equivalent” vertex ordering:

Proposition C.4 (Every tree decomposition has an “equivalent” vertex ordering [7]).

Given a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ , for every tree decomposition $(T,\chi)$ , there is a vertex ordering $\sigma$ such that every elimination set $U_{j}^{\sigma}$ of $\sigma$ is contained in some bag $\chi(t)$ of the tree decomposition $(T,\chi)$ .

FAQ-width of an FAQ query

Just like a conjunctive query, an FAQ query has a query hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ . But unlike conjunctive queries, an FAQ query also specifies an order of its variables, which is the order in which we aggregate over those variables in the given FAQ-expression. (For example, in expression (73), we sum over $t$ first, then over $y$ , and we keep $z$ and $x$ as free variables. Hence, the FAQ query in (73) specifies the variable order $\sigma=(x,z,y,t)$ .) Such a variable order for the query can also be interpreted as a vertex order $\sigma$ for the query’s hypergraph.

The InsideOut algorithm for answering FAQ queries is based on variable elimination. To eliminate variable/vertex $v_{n}$ , we have to solve a sub-problem consisting of a smaller FAQ query over the variables in the elimination set $U_{n}^{\sigma}$ . This smaller query can be solved by an algorithm that is based on worst-case optimal join algorithms [69, 52, 53, 6]. From (74), this takes time 555To achieve this runtime, we need some additional ideas that are beyond the scope of this very brief introduction to FAQ. See [8] for more details.

[TABLE]

After eliminating $v_{n}$ , the remaining variables $v_{n-1},v_{n-2},\ldots,v_{1}$ can be eliminated similarly. This variable elimination algorithm motivates the following width notion.

Definition 13 (FAQ-width of a given variable ordering $\sigma$ ).

Given an FAQ query $\varphi$ with a variable ordering $\sigma$ , we define the FAQ-width of $\sigma$ , denoted by $\text{\sf faqw}(\sigma)$ , to be

[TABLE]

By the above definition, the FAQ-width of a variable ordering $\sigma$ is the same as the fractional hypertree width of the “equivalent” tree decomposition that is referred to in Proposition C.3.

Theorem C.5 (Runtime of $InsideOut$ [8]).

Given an FAQ-query $\varphi$ with a variable ordering $\sigma$ , the $InsideOut$ algorithm answers $\varphi$ in time

[TABLE]

where $|\varphi|$ is the output size in the listing representation.

Let $\varphi$ be an FAQ query with variable ordering $\sigma$ . In many cases, there might be a different variable ordering $\sigma^{\prime}$ such that if we were to permute the aggregates of $\varphi$ in the order of $\sigma^{\prime}$ instead of $\sigma$ , we would obtain an FAQ-query $\varphi^{\prime}$ that is “semantically-equivalent” to $\varphi$ (i.e. that always returns the same answer as $\varphi$ no matter what the input is). If this is the case, then we can run InsideOut on $\varphi$ using the ordering $\sigma^{\prime}$ instead of $\sigma$ , which can lead to a better runtime if $\text{\sf faqw}(\sigma^{\prime})$ happens to be smaller than $\text{\sf faqw}(\sigma)$ . We use $\text{\sf EVO}(\varphi)$ to denote the set of all such “equivalent” orderings $\sigma^{\prime}$ . (For a formal definition, see [8].) Therefore, it is best to consider all orderings $\sigma^{\prime}$ in $\text{\sf EVO}(\varphi)$ , pick the one with the smallest $\text{\sf faqw}(\sigma^{\prime})$ , and use it in InsideOut algorithm. This motivates the following definition.

Definition 14 (FAQ-width of an FAQ query).

The FAQ-width of an FAQ query $\varphi$ , denoted by $\text{\sf faqw}(\varphi)$ , is the minimum one over all orderings $\sigma^{\prime}$ in $\text{\sf EVO}(\varphi)$ , i.e.

[TABLE]

Characterizing $\text{\sf EVO}(\varphi)$ for an arbitrary given FAQ-query $\varphi$ is a technically involved problem (see [8] for hardness background and a general solution). However, the FAQ queries that we need for our machine learning tasks are of a special form that makes the problem easier.: The aggregate operator that we use in such queries is the summation operator $\sum$ . We refer to those restricted FAQ queries as FAQ-SS queries (see [8]). Our FAQ-SS queries in this work have only two types of variables:

•

Variables that we are summing over, e.g. variables $y$ and $t$ in (73).

•

Free variables (i.e. Group-by variables), e.g. variables $x$ and $z$ .

Given an FAQ-SS query $\varphi$ , $\text{\sf EVO}(\varphi)$ contains every ordering $\sigma^{\prime}$ that lists all free variables before the non-free variables. For example, for the FAQ-SS query $\varphi(x,z)$ in (73), $\text{\sf EVO}(\varphi(x,z))$ contains all permutations of $\{x,y,z,t\}$ where $\{x,z\}$ come before $\{y,t\}$ .

Proposition C.6.

For any FAQ-SS query $\varphi$ without free variables, we have $\text{\sf faqw}(\varphi)=\text{\sf fhtw}(\mathcal{H})$ , where $\mathcal{H}$ is the hypergraph of $\mathcal{H}$ .

Proof.

In this case, $\text{\sf EVO}(\varphi)$ contains all $n!$ possible orderings. By Proposition C.4, for every tree decomposition $(T,\chi)$ , there is an ordering $\sigma^{\prime}$ such that $\text{\sf faqw}(\sigma^{\prime})\leq\text{\sf fhtw}((T,\chi))$ . By Proposition C.3, for every ordering $\sigma^{\prime}$ , there is a tree decomposition $(T,\chi)$ such that $\text{\sf fhtw}((T,\chi))=\text{\sf faqw}(\sigma^{\prime})$ . Therefore, we have

[TABLE]

∎

Proposition C.7.

For any FAQ-SS query $\varphi$ with $f\geq 1$ free variables, we have $\text{\sf faqw}(\varphi)\leq\text{\sf fhtw}(\mathcal{H})+f-1$ , where $\mathcal{H}$ is the hypergraph of $\mathcal{H}$ .

Proof.

Find a tree decomposition $(T,\chi)$ of $\mathcal{H}$ with minimal fhtw, i.e. where $\text{\sf fhtw}((T,\chi))=\text{\sf fhtw}(\mathcal{H})$ . WLOG let the $f$ free variables be $v_{1},\ldots,v_{f}$ . Construct another tree decomposition $(T,\overline{\chi})$ by extending all bags $\chi(t)$ of $(T,\chi)$ with the variables $\{v_{2},\ldots,v_{f}\}$ , i.e. by defining $\overline{\chi}(t)=\chi(t)\cup\{v_{2},\ldots,v_{f}\}$ for all $t$ . By Definition 7, $(T,\overline{\chi})$ is indeed a tree decomposition. And because $\rho^{*}\left(\chi(t)\cup\{v_{2},\ldots,v_{f}\}\right)\leq\rho^{*}(\chi(t))+f-1$ , we have

[TABLE]

Moreover, since $(T,\chi)$ must have a bag $\chi(t^{*})$ that contains $v_{1}$ , the corresponding bag $\overline{\chi}(t^{*})$ of $(T,\overline{\chi})$ contains all the free variables $\{v_{1},\ldots,v_{f}\}$ . We designate $t^{*}$ as the root of $T$ , and then we run GYO elimination procedure [3] on the bags $\overline{\chi}(t)$ of $(T,\overline{\chi})$ to construct a vertex ordering $\sigma^{\prime}$ with $\text{\sf faqw}(\sigma^{\prime})\leq\text{\sf fhtw}((T,\overline{\chi}))$ . Moreover, if we choose to eliminate the vertices of the root $t^{*}$ at the end of GYO elimination (after all other vertices have already been eliminated), we can make the free variables $\{v_{1},\ldots,v_{f}\}$ appear before all other variables in $\sigma^{\prime}$ , thus making sure that $\sigma^{\prime}$ is indeed in $\text{\sf EVO}(\varphi)$ and completing the proof. In particular, we apply GYO elimination as follows:

•

If the tree $T$ contains only one node $t^{*}$ :

–

We eliminate vertices in $\overline{\chi}(t^{*})-\{v_{1},\ldots,v_{f}\}$ before eliminating $\{v_{1},\ldots,v_{f}\}$ .

–

We remove $t^{*}$ from $T$ , thus making $T$ an empty tree.

•

Otherwise, we pick a leaf node $t$ of $T$ (other than the root $t^{*}$ ). Let $t^{\prime}$ be the parent of $t$ in $T$ :

–

If $\overline{\chi}(t)\subseteq\overline{\chi}(t^{\prime})$ , then we remove node $t$ from $T$ along with the associated bag $\overline{\chi}(t)$ .

–

Otherwise, $\overline{\chi}(t)$ must have a vertex $u$ that is not in $\overline{\chi}(t^{\prime})$ . (Hence, by property (b) of Definition 7, $u$ is not in $\overline{\chi}(t^{\prime\prime})$ for all $t^{\prime\prime}$ in $T$ other than $t$ .)

If $u$ is the only vertex in $\overline{\chi}(t)$ , then we remove node $t$ from $T$ along with the associated bag $\overline{\chi}(t)$ .

*

Otherwise, we remove $u$ from $\overline{\chi}(t)$ .

•

We repeat the above steps until $T$ becomes an empty tree.

∎

Appendix D Missing details from Section 4

Proof of Theorem 4.1.

We start with point evaluation:

[TABLE]

The gradient formula follows straightforwardly from (16) and the chain rule. ∎

Proof of Corollary 21.

From (18) we have

[TABLE]

∎

Proof of Proposition 4.4.

For any event $E$ , let $\delta_{E}$ denote the Kronecker delta, i.e. $\delta_{E}=1$ if $E$ holds, and $\delta_{E}=0$ otherwise. Recall that the input query $Q$ has hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ , and there is an input relation $R_{F}$ for every hyperedge $F\in\mathcal{E}$ . Recall that we can write $\vec{\sigma}_{ij}$ in the tensor form as shown in Eq. (23). Plugging in the definition of $h_{i}$ and $h_{j}$ from (9); and, let $C_{ij}=C_{i}\cup C_{j}$ and $V_{ij}=V_{i}\cup V_{j}$ , we have

[TABLE]

As illustrated in Example 13, the tensor $\bigotimes_{f\in C_{i}}\mathbf{x}_{f}\otimes\bigotimes_{f\in C_{j}}\mathbf{x}_{f}$ is very sparse. For a fixed tuple $\mathbf{x}$ , in fact, the tensor has only one $1$ entry, corresponding to the combination of values of the attributes in $C_{ij}$ . Hence, $\vec{\sigma}_{ij}$ is a function of the variables $C_{ij}$ . In the FAQ-framework, the query representing $\vec{\sigma}_{ij}$ can be expressed as a Sum-Product queries with free (i.e., group-by) variables $C_{ij}$ , defined by:

[TABLE]

Similarly, the tensor $\mathbf{c}_{j}$ can be sparsely represented by an aggregate query with group-by attributes $C_{j}$ , which is expressed as the Sum-Product query

[TABLE]

The overall runtimes for computing the above FAQ-queries follow from applying the InsideOut algorithm and Theorem C.5 [8]. ∎

Proof of Proposition 33.

The fact that $\text{\sf faqw}(i,j)\leq\text{\sf fhtw}+c-1$ follows from Proposition C.7. Since $\vec{\sigma}_{ij}$ is a tensor of order at most $c$ , and each attribute’s active domain has size at most $N$ , it follows that $|\vec{\sigma}_{ij}|\leq N^{c}$ . And, $|\vec{\sigma}_{ij}|\leq|D|$ because the support of the tensor $\vec{\sigma}_{ij}$ cannot be more than the output size.

Fix a query $Q$ with $\rho^{*}>\text{\sf fhtw}+c-1\geq c$ . Consider a database instance $I$ for which $|D|$ (the output size of $Q$ ) is $\Theta(N^{\rho^{*}})$ . (The existence of such database instances is guaranteed by Theorem C.2.) From this (33) follows trivially. ∎

Proof of Proposition 4.6.

We first analyze the time it takes to compute expression (16), which is dominated by the quadratic form $g(\vec{\theta})^{\top}\vec{\Sigma}g(\vec{\theta})$ . To compute this quadratic form, for every pair $i,j\in[m]$ we need to compute $g_{i}(\vec{\theta})^{\top}\vec{\sigma}_{ij}g_{j}(\vec{\theta})$ . This product is broken up into a sum of $t_{i}t_{j}$ terms when we expand $g_{i}$ and $g_{j}$ out. Each of those terms is computed in time $O(d_{i}d_{j}|\vec{\sigma}_{ij}|)$ . The runtime for computing (17) is analyzed similarly. ∎

Appendix E Missing details from Section 5

In the proofs below, for each feature $w\in V$ , $\mathbf{I}_{w}$ denote the identity matrix whose dimension is the size of the effective domain of $w$ . This is not to be confused with the notation $\mathbf{I}_{n}$ which is an order- $n$ identity matrix.

E.1 Missing rewriting steps in the example in Section 5.1

We first prove identity (42) formally. By setting the gradient in (41) to [math], we have $\vec{\theta}_{\textsf{country}}={(\mathbf{I}_{\textsf{country}}+\mathbf{R}\mathbf{R}^{\top})^{-1}\mathbf{R}}\vec{\gamma}_{\textsf{city}}$ . Hence, it remains to show the identity

[TABLE]

To see this, we apply the Sherman-Morrison-Woodbury identity (63) with $\mathbf{A}=\mathbf{I}_{\textsf{country}}$ and $\mathbf{U}=\mathbf{R}^{\top}$ to get

[TABLE]

multiply both sides of (81) on the right by $\mathbf{R}$ , we obtain

[TABLE]

We next show (43); to do so, it is sufficient to verify that

[TABLE]

For the sake of brevity, define ${\mathbf{B}=\mathbf{I}_{\textsf{city}}+\mathbf{R}^{\top}\mathbf{R}}$ so that $\vec{\theta}_{\textsf{country}}={\mathbf{R}\mathbf{B}^{-1}}\vec{\gamma}_{\textsf{city}}$ . We compute each term on the left hand side separately:

[TABLE]

From here, we derive (82) by

[TABLE]

E.2 Proof of Theorem 5.1

Proof.

We start by breaking the loss term into two parts

[TABLE]

and rewrite the second part:

[TABLE]

Equality (85) follows from (72). Equality (86) follows from (65). Equality at (88) is a bit loaded. What goes on there is that we broke the sum over $\mathbf{a}_{V}$ for which $\left\|\mathbf{a}_{V}\right\|_{1}\leq d$ and $\left\|\mathbf{a}_{G}\right\|_{1}>0$ into a nested triple sum. First of all, in order for $\left\|\mathbf{a}_{G}\right\|_{1}>0$ , obviously $\left\|\mathbf{a}_{\overline{G}}\right\|_{1}<d$ must hold, so we group by those tuples first. The remaining mass $\left\|\mathbf{a}_{G}\right\|_{1}$ can only be at most $d-\left\|\mathbf{a}_{\overline{G}}\right\|_{1}=d-q$ . Since all features in $G$ are categorical, from the above analysis we have $\mathbf{a}_{G}=(a_{g})_{g\in G}\in\{0,1\}^{G}$ , i.e., $\mathbf{a}_{G}$ is a characteristic vector of a subset $U\subseteq G$ . Let $T=\{i\ |\ U_{i}\neq\emptyset\}\subseteq[k]$ . Then, in the second summation we group $U$ by $T$ . The third summation ranges over all choices of $U\cap G_{i}$ , $i\in T$ , for which the total mass is at most $d-q$ . (Recall the definition of $\mathcal{U}(T,q)$ in (45).)

Next, in (89) we perform the reparameterization. Recall that $\mathbf{1}_{F_{T}|F}$ is the characteristic vector of the set $F_{T}=\{f_{i}\}_{i\in T}$ in the collection $F=\{f_{1},\dots,f_{k}\}$ . The new parameter $\vec{\gamma}_{(\mathbf{a}_{\overline{G}},\mathbf{1}_{F_{T}|F})}$ is indexed by the tuple $(\mathbf{a}_{\overline{G}},\mathbf{1}_{F_{T}|F})$ whose support is $\overline{G}\cup F=\overline{S}$ , i.e., the set of all features except for the ones functionally determined by features in $F$ . After the reparameterization, the loss term is identical to the loss term of a $\text{\sf PR}^{d}$ model whose features are $\overline{S}$ . This explains the collapsed pair $(\bar{g},\bar{h})$ used in the theorem.

Next, we explore the new parameter and how it affects the penalty term. Consider a fixed pair $\mathbf{a}_{\overline{G}}$ and $T\subseteq[k]$ such that $T\neq\emptyset$ and $\left\|\mathbf{a}_{\overline{G}}\right\|_{1}+|T|\leq d$ . The last condition is implicit for the set $U$ to exist for which $U\cap G_{i}\neq\emptyset$ and $\left\|\mathbf{a}_{\overline{G}}\right\|_{1}+|U|\leq d$ . Among all choices of $U$ , we single out $U=F_{T}$ and write

[TABLE]

Now we are ready to write the penalty term $\left\|\vec{\theta}\right\|_{2}^{2}$ in terms of the new parameter $\vec{\gamma}$ and some “left-over” components of $\vec{\theta}$ .

[TABLE]

Next, for every $W\in\mathcal{U}(T,q)-\{F_{T}\}$ , we optimize out the parameter $\vec{\theta}_{(\mathbf{a}_{\overline{G}},\mathbf{1}_{W|G})}$ by noting that the new loss term does not depend on these parameters. To optimize them out, we compute

[TABLE]

Setting this partial derivative to [math], we obtain $\vec{\theta}_{(\mathbf{a}_{\overline{G}},\mathbf{1}_{W|G})}=\mathbf{R}_{\mathbf{a}_{\overline{G}},W}\vec{\theta}_{(\mathbf{a}_{\overline{G}},\mathbf{1}_{F_{T}|G})}$ , which leads to

[TABLE]

Moving and grouping, we obtain

[TABLE]

Or, more compactly,

[TABLE]

Consequently, we can completely optimize out the remaining $\vec{\theta}$ -components, solving for them in terms of the components of $\vec{\gamma}$ :

[TABLE]

Thus, for a fixed $T$ and $\overline{G}$ , we can simplify the total squared normed involved:

[TABLE]

Finally, we write $\left\|\vec{\theta}\right\|_{2}^{2}$ in terms of the new parameter $\vec{\gamma}$ to prove (49):

[TABLE]

∎

E.3 Alternative to Corollary 21

One big advantage of a linear model in terms of BGD is Corollary 21, where we do not have to redo point-evaluation for every backtracking step. After the reparameterization exploiting FD-based dimensionality reduction, Corollary 21 does not work as is, because we have changed the penalty terms. However, it is easy to work out a similar result in terms of the new parameter space; see The point of the following proposition is that we only need to compute intermediate results involving the covariance matrix $\overline{\vec{\Sigma}}$ once while backtracking. For each new value of $\alpha$ , we will need to recompute the penalty’s objective $\overline{\Omega}(\vec{\gamma}-\alpha\mathbf{d})$ , which is an inexpensive operation. If $\lambda=0$ , we can even solve for $\alpha$ directly.

Proposition E.1.

With respect to the new parameters (and new objective $\overline{J}$ defined in (48)), the Armijo condition $\overline{J}(\vec{\gamma})-\overline{J}(\vec{\gamma}-\alpha\mathbf{d})\leq\frac{\alpha}{2}\left\|\mathbf{d}\right\|_{2}^{2}$ is equivalent to

[TABLE]

where $\mathbf{d}=\mbox{\boldmath$ \nabla $}\overline{J}(\vec{\gamma})$ . Furthermore, the next gradient of $\overline{J}$ is also readily available:

[TABLE]

Proof.

Let $\mathbf{d}=\mbox{\boldmath$ \nabla $}\overline{J}(\vec{\gamma})$ . Then,

[TABLE]

∎

E.4 Specializing Theorem 5.1 to the LR model

This section specializes Theorem 5.1 to the LR-model. Let us first specialize expressions (45), (46). and (47), We start with (45). Since $d=1$ , the only valid choice of $q$ is [math], and $|T|=1$ . If $T=\{j\}$ , then $U\in\mathcal{U}(T,q)$ iff $U=\{c\}$ for some $c\in G_{j}$ . In other words, we can replace $\mathcal{U}(T,q)$ by $G_{j}$ itself. Next, consider (46): there is only one valid choice of $\mathbf{\mathbf{missing}}a_{\overline{G}}$ – the all [math] vector – and $U=\{c\}$ for some $c\in G_{j}$ , the matrix $\mathbf{R}_{\mathbf{a}_{\overline{G}},U}$ is exactly $\mathbf{R}_{c}$ . Lastly, when $T=\{j\}$ the sum (47) becomes $\sum_{c\in G_{j}}\mathbf{R}_{c}^{\top}\mathbf{R}_{c}$ . We have the following corollary:

Corollary E.2.

Consider a LR model with parameters $\vec{\theta}=(\vec{\theta}_{w})_{w\in V}$ and $k$ groups of simple FDs $G_{i}=\{f_{i}\}\cup S_{i}$ , $i\in[k]$ . Define the following reparameterization:

[TABLE]

Then, minimizing $J(\vec{\theta})$ is equivalent to minimizing the function $\overline{J}(\vec{\gamma})=\frac{1}{2}\vec{\gamma}^{\top}\overline{\vec{\Sigma}}\vec{\gamma}-\left\langle\vec{\gamma},\overline{\mathbf{c}}\right\rangle+\frac{\lambda}{2}\Omega(\vec{\gamma}),$ where $\Omega(\vec{\gamma})=\sum_{w\in V\setminus G}\left\|\vec{\gamma}_{w}\right\|_{2}^{2}+\sum_{i=1}^{k}\left\langle\mathbf{B}_{i}^{-1}\gamma_{f_{i}},\gamma_{f_{i}}\right\rangle$ , and matrix $B_{i}$ for each $i\in[k]$ is given by

[TABLE]

$\overline{J}$ is defined with respect to the FD-reduced pair of functions $\overline{g},\overline{h}$ and a reduced parameter space of $\vec{\gamma}$ . Its gradient is very simple to compute, where we specialize (50):

[TABLE]

Moreover, once a minimizer $\vec{\gamma}$ of $\overline{J}$ is obtained, following (51), we can compute a minimizer $\vec{\theta}$ of $J$ by setting

[TABLE]

E.5 Specializing Theorem 5.1 to the $\text{\sf PR}^{2}$ model

In this section we explores Theorem 5.1 for the special case of degree-2 polynomial regression. This case is significant for three reasons. First, due to the explosion in the number of parameters, in practice one rarely runs polynomial regression of degree higher than $2$ . In fact, $\text{\sf PR}^{2}$ may be a sufficiently rich nonlinear regression model for many real-world applications. Second, this is technically already a highly non-trivial application of our general theorem. Third, this case shares some commonality with $\text{\sf FaMa}^{2}_{r}$ model to be described in the next section.

As before, we first specialize expressions (45), (46), and (47). To do so, we slightly change the indexing scheme of the model to simplify the presentation. In the general model, we use tuples $\mathbf{a}$ with $\left\|\mathbf{a}\right\|_{1}\leq d$ to index parameters. When the model is of degree $2$ , we explicitly write down the two types of indices: we use $\vec{\theta}_{w}$ , $w\in V$ instead of $\vec{\theta}_{\mathbf{a}}$ with $\left\|\mathbf{a}\right\|_{1}=1$ , and we use $\vec{\theta}_{cw}$ with $c,w\in V$ instead of $\vec{\theta}_{\mathbf{a}}$ when $\left\|\mathbf{a}\right\|_{1}=2$ .

We start with (45). Since $d=2$ , two valid choices of $q$ are [math] and $1$ .

•

when $q=1$ , $|T|=\{i\}$ for some $i\in[k]$ . The set $\mathcal{U}(\{i\},1)$ is the collection of singleton subsets of $G_{i}$ . Hence, this is similar to the linear regression situation.

•

when $q=0$ , $|T|$ is either $\{i\}$ or $\{i,j\}$ . The set $\mathcal{U}(\{i,j\},0)$ consists of all $2$ -subsets $U$ of $G$ for which $U$ contains one element from $G_{i}$ and one from $G_{j}$ . The set $\mathcal{U}(\{i\},0)$ contains all singletons and $2$ -subsets of $G_{i}$ .

Next, consider (46): there are two valid choices for the pair $(\mathbf{a}_{\overline{G}},U)$ :

•

when $\left\|\mathbf{a}_{\overline{G}}\right\|_{1}=0$ , $U\in\mathcal{U}(\{i,j\},0)$ or $U\in\mathcal{U}(\{i\},0)$ . In that case, we have

[TABLE]

•

when $\left\|\mathbf{a}_{\overline{G}}\right\|_{1}=1$ , $U\in\mathcal{U}(\{i\},1)$ for some $i\in[k]$ ; and in this case we use $w\in\overline{G}$ to represent $\mathbf{a}_{\overline{G}}$ :

[TABLE]

Finally, we write down (47) explicitly; the valid indices for $\mathbf{B}$ are $\mathbf{B}_{\emptyset,\{i\}}$ $\mathbf{B}_{\emptyset,\{i,j\}}$ , and $\mathbf{B}_{w,\{i\}}$ (also recall the definition of $\mathbf{B}_{i}$ in (93)):

[TABLE]

Corollary E.3.

Consider the $\text{\sf PR}^{2}$ model with $k$ groups of simple FDs $G_{i}=\{f_{i}\}\cup S_{i}$ , $i\in[k]$ . Let

[TABLE]

be the original parameters, and $G=\cup_{i\in[k]}G_{i}$ . Define the following reparameterization:

[TABLE]

Then, minimizing $J(\vec{\theta})$ is equivalent to minimizing the function $\overline{J}(\vec{\gamma})=\frac{1}{2}\vec{\gamma}^{\top}\overline{\vec{\Sigma}}\vec{\gamma}-\left\langle\vec{\gamma},\overline{\mathbf{c}}\right\rangle+\frac{\lambda}{2}\Omega(\vec{\gamma}),$ where

[TABLE]

The gradient of $\overline{J}$ is very simple to compute, by noticing that $\overline{J}$ is defined with respect to the FD-reduced pair of functions $\overline{g},\overline{h}$ and a reduced parameter space of $\vec{\gamma}$ . Its gradient can be computed by specializing (50):

[TABLE]

Moreover, once a minimizer $\vec{\gamma}$ of $\overline{J}$ is obtained, following (51), we can compute a minimizer $\vec{\theta}$ of $J$ by setting

[TABLE]

E.6 Proofs of results in Section 5.4

Proof of Theorem 5.2.

We begin with a similar derivation, where “relevant terms” of $\left\langle g(\vec{\theta}),h(\mathbf{x})\right\rangle$ are the terms where $h$ contains a feature $c\in F_{i}$ for some $i\in[k]$ :

[TABLE]

The above derivation immediately yields the reparameterization given in the statement of the theorem, which we reproduce here for the sake of clarity:

[TABLE]

Note that we did not define $\vec{\gamma}_{w}$ for $w\in S_{i}$ , $i\in[k]$ . The reason we can do so, is because we can optimize out $\vec{\theta}_{c}$ due to the following trick we have been using (as in the proof of Theorem 5.1). First, we rewrite all the terms in $\left\|\vec{\theta}\right\|_{2}^{2}$ in terms of $\vec{\gamma}$ and $\vec{\theta}_{c}$ , $c\in S_{i}$ , $i\in[k]$ :

[TABLE]

Since $\vec{\theta}_{t}$ , $t\in S_{i}$ , does not depend on the loss term, we have

[TABLE]

By setting (102) to [math], we have $\vec{\theta}_{t}=\mathbf{R}_{t}\vec{\theta}_{f_{i}}$ for all $t\in F_{i}$ , and thus

[TABLE]

which implies $\vec{\theta}_{f_{i}}=\mathbf{B}_{i}^{-1}(\vec{\gamma}_{f_{i}}-\vec{\beta}_{f_{i}})$ . Hence, the following always holds:

[TABLE]

Note also that,

[TABLE]

Due to the fact that $\vec{\theta}^{(\ell)}_{f_{i}}=\vec{\gamma}^{(\ell)}_{f_{i}}-\sum_{c\in S_{i}}\mathbf{R}^{\top}_{c}\vec{\gamma}^{(\ell)}_{c}$ , we can now write the penalty term in terms of the new parameter $\vec{\gamma}$ :

[TABLE]

∎

Proof of Proposition 5.3.

The goal is to derive the gradient of $\Omega(\vec{\gamma})$ w.r.t the parameters $\vec{\gamma}$ . Since $\vec{\beta}_{f_{i}}$ is a function of $\vec{\gamma}_{c}^{(\ell)}$ , $\ell\in[r]$ , $c\in F_{i}$ , the following is immediate:

[TABLE]

Next, we have to simplify $\vec{\beta}_{f_{i}}$ to facilitate fast computation:

[TABLE]

Next, we derive the partial derivative w.r.t. $\vec{\gamma}_{f_{i}}^{(\ell)}$ for a fixed $i\in[k]$ , $\ell\in[r]$ ; in this computation we make use of (5) above:

[TABLE]

Lastly, we move on to the partial derivative w.r.t. $\vec{\gamma}_{w}^{(\ell)}$ for a fixed $i\in[k]$ , $w\in S_{i}$ , $\ell\in[r]$ :

[TABLE]

In particular, we were able to reuse the computation of $\frac{1}{2}\frac{\partial\left\|\vec{\theta}\right\|_{2}^{2}}{\partial\vec{\gamma}_{f_{i}}^{(\ell)}}$ and $\frac{1}{2}\frac{\partial\left\|\vec{\theta}\right\|_{2}^{2}}{\partial\vec{\gamma}_{f_{i}}}$ to compute $\frac{1}{2}\frac{\partial\left\|\vec{\theta}\right\|_{2}^{2}}{\partial\vec{\gamma}_{w}^{(\ell)}}$ . There is, however, still one complicated term $\vec{\beta}_{f_{i}}$ left to compute. We simplify $\vec{\beta}_{f_{i}}$ to make its evaluation faster as follows.

[TABLE]

This completes the proof. ∎

Bibliography71

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In OSDI , pages 265–283, 2016.
2[2] S. Abiteboul, M. Arenas, P. Barceló, M. Bienvenu, D. Calvanese, C. David, R. Hull, E. Hüllermeier, B. Kimelfeld, L. Libkin, W. Martens, T. Milo, F. Murlak, F. Neven, M. Ortiz, T. Schwentick, J. Stoyanovich, J. Su, D. Suciu, V. Vianu, and K. Yi. Research directions for principles of data management (dagstuhl perspectives workshop 16151). Dagstuhl Manifestos , 7(1):1–29, 2018.
3[3] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases . Addison-Wesley, 1995.
4[4] M. Abo Khamis, H. Q. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. AC/DC: In-database learning thunderstruck. In Data Management for End-To-End Machine Learning , DEEM’18, pages 8:1–8:10, 2018.
5[5] M. Abo Khamis, H. Q. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. In-database learning with sparse tensors. In PODS , pages 325–340, 2018.
6[6] M. Abo Khamis, H. Q. Ngo, C. Ré, and A. Rudra. Joins via geometric resolutions: Worst-case and beyond. In PODS , pages 213–228, 2015.
7[7] M. Abo Khamis, H. Q. Ngo, and A. Rudra. FAQ: questions asked frequently. Co RR , abs/1504.04044, 2015.
8[8] M. Abo Khamis, H. Q. Ngo, and A. Rudra. FAQ: Questions asked frequently. In PODS , pages 13–28, 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Learning Models over Relational Data using Sparse Tensors and Functional Dependencies

Abstract

1 Introduction

Contributions

Organization

2 Preliminaries

2.1 Feature Extraction Query

Example 1**.**

2.2 Matrix calculus

Basics

The Product Cookbook: Tensor product, Kronecker product, and Khatri-Rao product

Definition 1** (Tensor product).**

Definition 2** (Khatri-Rao product).**

Definition 3** (Hadamard product).**

3 Problem formulation

3.1 Continuous features

Example 2**.**

Example 3**.**

Example 4**.**

Example 5**.**

Example 6**.**

3.2 Categorical features

3.3 Tensor product representation

Example 7**.**

Example 8**.**

Example 9**.**

Example 10**.**

4 Database-centric problem reformulation

4.1 Solution for square loss problems

Continuous features

Theorem 4.1**.**

Example 11**.**

Corollary 4.2**.**

Categorical features via sparse tensors

Example 12**.**

Theorem 4.3**.**

Example 13**.**

4.2 Solution for Principal Component Analysis

Continuous features

Categorical features via sparse tensors

Example 14**.**

4.3 Efficient computation of the sparse tensor representation

Proposition 4.4**.**

Proposition 4.5**.**

4.4 Point evaluation and gradient computation

Proposition 4.6**.**

4.5 Diagram of our structure-aware approach revisited

5 FD-Aware Optimization

5.1 Introduction to the main ideas

5.2 Functional dependencies (FDs)

Definition 4**.**

Example 15**.**

Definition 5** (FD-reduced pairs of functions).**

5.3 Polynomial regression under FDs

Theorem 5.1**.**

5.4 Factorization machines under FDs

Theorem 5.2**.**

Proposition 5.3**.**

5.5 Principal Component Analysis under FDs

Lemma 5.4**.**

Proof.

5.6 Linear algebra with database queries

Maintaining the matrix inverse with rank-111 updates

Maintaining a Cholesky decomposition with rank-kkk update

5.7 Discussion

Diagram of our structure-aware learning in the presence of FDs

Impact of FDs on model complexity

Impact of FDs on computational complexity

6 The Design and Implementation of AC/DC

6.1 Factorized aggregate computation

Definition 1** (adapted from [56]).**

Example 16**.**

Example 17**.**

6.2 Shared computation of aggregates

Example 1.

Definition 1 (Tensor product).

Definition 2 (Khatri-Rao product).

Definition 3 (Hadamard product).

Example 2.

Example 3.

Example 4.

Example 5.

Example 6.

Example 7.

Example 8.

Example 9.

Example 10.

Theorem 4.1.

Example 11.

Corollary 4.2.

Example 12.

Theorem 4.3.

Example 13.

Example 14.

Proposition 4.4.

Proposition 4.5.

Proposition 4.6.

Definition 4.

Example 15.

Definition 5 (FD-reduced pairs of functions).

Theorem 5.1.

Theorem 5.2.

Proposition 5.3.

Lemma 5.4.

Maintaining the matrix inverse with rank- $1$ updates

Maintaining a Cholesky decomposition with rank- $k$ update

Definition 1 (adapted from [56]).

Example 16.

Example 17.

Example 18.

Example 19.

Proposition A.1.

Proposition A.2.

Definition 6 (Fractional edge cover number $\rho^{*}$ ).

Theorem C.1 (AGM-bound [12, 33]).

Theorem C.2 (AGM-bound is tight [12, 33]).

Definition 7 (Tree decomposition).

Definition 8 (acyclicity).

Definition 9 ( $g$ -width of a hypergraph: a generic width notion [9]).

Definition 10 (Treewidth and fractional hypertree width are special cases of $g$ -width).

Definition 11 (Vertex ordering of a hypergraph).

Definition 12 (Elimination sets $U_{j}^{\sigma}$ of a vertex ordering $\sigma$ ).

Proposition C.3 (Every vertex ordering has an “equivalent” tree decomposition [7]).

Proposition C.4 (Every tree decomposition has an “equivalent” vertex ordering [7]).

Definition 13 (FAQ-width of a given variable ordering $\sigma$ ).

Theorem C.5 (Runtime of $InsideOut$ [8]).

Definition 14 (FAQ-width of an FAQ query).

Proposition C.6.

Proposition C.7.

Proposition E.1.

Corollary E.2.

E.5 Specializing Theorem 5.1 to the $\text{\sf PR}^{2}$ model

Corollary E.3.