Probabilistic Models for Query Approximation with Large Sparse Binary Datasets
Dmitry Y. Pavlov, Heikki Mannila, Padhraic Smyth

TL;DR
This paper explores probabilistic models, especially Markov random fields, for estimating query selectivity in large sparse binary datasets, demonstrating improved accuracy at higher computational costs and proposing methods to optimize performance.
Contribution
It introduces a Markov random field approach for query approximation, compares it with existing models, and proposes optimization techniques for large-scale data applications.
Findings
MRF models outperform simpler models in accuracy
Optimization techniques reduce computational costs
Experimental validation on real-world datasets
Abstract
Large sparse sets of binary transaction data with millions of records and thousands of attributes occur in various domains: customers purchasing products, users visiting web pages, and documents containing words are just three typical examples. Real-time query selectivity estimation (the problem of estimating the number of rows in the data satisfying a given predicate) is an important practical problem for such databases. We investigate the application of probabilistic models to this problem. In particular, we study a Markov random field (MRF) approach based on frequent sets and maximum entropy, and compare it to the independence model and the Chow-Liu tree model. We find that the MRF model provides substantially more accurate probability estimates than the other methods but is more expensive from a computational and memory viewpoint. To alleviate the computational requirements we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Data Management and Algorithms · Data Mining Algorithms and Applications
