A Closer Look at Variance Implementations in Modern Database Systems
Niranjan Kamat, Arnab Nandi

TL;DR
This paper examines how modern database systems implement variance calculations, highlighting issues like precision loss and non-distributive methods, and reviews best practices for accurate and efficient variance computation in various contexts.
Contribution
It provides a comprehensive analysis of variance implementations in real-world systems, identifying their limitations and offering recommendations based on historical and recent research.
Findings
PostgreSQL 9.4 uses a representation prone to precision loss
Most commercial systems use efficient but potentially inaccurate methods
Literature review offers best practices for variance computation
Abstract
Variance is a popular and often necessary component of sampled aggregation queries. It is typically used as a secondary measure to ascertain statistical properties of the result such as its error. Yet, it is more expensive to compute than simple, primary measures such as \texttt{SUM}, \texttt{MEAN}, and \texttt{COUNT}. There exist numerous techniques to compute variance. While the definition of variance is considered to require multiple passes on the data, other mathematical representations can compute the value in a single pass. Some single-pass representations, however, can suffer from severe precision loss, especially for large number of data points. In this paper, we study variance implementations in various real-world systems and find that major database systems such as PostgreSQL 9.4 and most likely System X, a major commercially used closed-source database, use a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Database Systems and Queries · Bayesian Modeling and Causal Inference
