Computing Multi-Relational Sufficient Statistics for Large Databases

Zhensong Qian; Oliver Schulte; Yan Sun

arXiv:1408.5389·cs.LG·October 23, 2014

Computing Multi-Relational Sufficient Statistics for Large Databases

Zhensong Qian, Oliver Schulte, Yan Sun

PDF

TL;DR

This paper introduces a scalable dynamic programming algorithm for computing multi-relational sufficient statistics, including negative relationships, in large databases without materializing join tables, enabling advanced statistical analysis.

Contribution

It presents a novel M"obius virtual join algorithm that efficiently computes counts involving positive and negative relationships in large, complex databases.

Findings

01

Scales to datasets over 1 million tuples

02

Improves feature selection and rule mining

03

Enables Bayesian network learning with relational data

Abstract

Databases contain information about which relationships do and do not hold among entities. To make this information accessible for statistical analysis requires computing sufficient statistics that combine information from different database tables. Such statistics may involve any number of {\em positive and negative} relationships. With a naive enumeration approach, computing sufficient statistics for negative relationships is feasible only for small databases. We solve this problem with a new dynamic programming algorithm that performs a virtual join, where the requisite counts are computed without materializing join tables. Contingency table algebra is a new extension of relational algebra, that facilitates the efficient implementation of this M\"obius virtual join operation. The M\"obius Join scales to large datasets (over 1M tuples) with complex schemas. Empirical evaluation with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.