Fast generalised linear models by database sampling and one-step   polishing

Thomas Lumley

arXiv:1803.05165·stat.CO·March 15, 2018

Fast generalised linear models by database sampling and one-step polishing

Thomas Lumley

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method to efficiently fit generalized linear models directly from large relational databases using sampling and one-step refinement, achieving asymptotic efficiency comparable to maximum likelihood estimation.

Contribution

It presents a novel approach combining database sampling with a one-step polishing technique to fit generalized linear models efficiently on large datasets.

Findings

01

Estimator is fully efficient and asymptotically equivalent to MLE.

02

Method works with only two database queries: sampling and aggregation.

03

Implementation demonstrated with real-world taxi and car color data.

Abstract

In this note, I show how to fit a generalised linear model to $N$ observations on $p$ variables stored in a relational database, using one sampling query and one aggregation queries, as long as $N^{\frac{1}{2} + δ}$ observations can be stored in memory. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated from the Fisher information in the usual way. A proof-of-concept implementation uses R with MonetDB and with SQLite, and could easily be adapted to other popular databases. I illustrate the approach with examples of taxi-trip data in New York City and factors related to car colour in New Zealand.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tslumley/dbglm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Statistical Methods and Inference · Data Management and Algorithms