Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets
A. Moore, M. S. Lee

TL;DR
This paper presents the ADtree, a sparse data structure that enables efficient counting in large datasets, significantly accelerating various machine learning algorithms by reducing memory and computation costs.
Contribution
The paper introduces the ADtree, a novel sparse data structure that improves counting efficiency and memory usage for large datasets in machine learning tasks.
Findings
ADtrees enable fast, memory-efficient counting in large datasets.
Empirical results show ADtrees outperform traditional counting methods.
ADtrees accelerate algorithms like Bayes net learning and feature selection.
Abstract
This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table. We provide a very sparse data structure, the ADtree, to minimize memory use. We provide analytical worst-case bounds for this structure for several models of data distribution. We empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Data Quality and Management · Bayesian Modeling and Causal Inference
