# Learning Models over Relational Data using Sparse Tensors and Functional   Dependencies

**Authors:** Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich

arXiv: 1703.04780 · 2020-02-10

## TL;DR

This paper presents a unified framework and system for efficient relational data analytics using sparse tensors and functional dependencies, enabling faster and accurate statistical model training over databases.

## Contribution

It introduces a theoretical framework combining database theory and linear algebra for structure-aware learning, and implements the AC/DC system demonstrating significant performance improvements.

## Key findings

- AC/DC achieves comparable accuracy to existing tools.
- AC/DC is up to 1000 times faster in typical applications.
- AC/DC handles large datasets without memory or timeouts.

## Abstract

Integrated solutions for analytics over relational databases are of great practical importance as they avoid the costly repeated loop data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and train the desired model using this tool. These integrated solutions are also a fertile ground of theoretically fundamental and challenging problems at the intersection of relational and statistical data models.   This article introduces a unified framework for training and evaluating a class of statistical learning models over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. We show that, by synergizing key tools from database theory such as schema information, query structure, functional dependencies, recent advances in query evaluation algorithms, and from linear algebra such as tensor and matrix operations, one can formulate relational analytics problems and design efficient (query and data) structure-aware algorithms to solve them.   This theoretical development informed the design and implementation of the AC/DC system for structure-aware learning. We benchmark the performance of AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting and advertisement planning applications, AC/DC can learn polynomial regression models and factorization machines with at least the same accuracy as its competitors and up to three orders of magnitude faster than its competitors whenever they do not run out of memory, exceed 24-hour timeout, or encounter internal design limitations.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1703.04780/full.md

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/1703.04780/full.md

## References

71 references — full list in the complete paper: https://tomesphere.com/paper/1703.04780/full.md

---
Source: https://tomesphere.com/paper/1703.04780