A Stream-Suitable Kolmogorov-Smirnov-Type Test for Big Data Analysis
Hien Duy Nguyen

TL;DR
This paper introduces the CAKS goodness-of-fit test, a new stream-compatible and scalable adaptation of the Kolmogorov-Smirnov test for big data, capable of handling extremely large sample sizes efficiently.
Contribution
The paper develops the CAKS test, a novel KS-type test designed for streaming and large-scale data, with proven asymptotic normality and consistency against various alternatives.
Findings
The CAKS test is faster than the traditional KS test for large samples.
It is effective in detecting distribution deviations in mean, variance, and shape.
The test is applicable to sample sizes of 10^9 and beyond.
Abstract
Big Data has become an ever more commonplace setting that is encountered by data analysts. In the Big Data setting, analysts are faced with very large numbers of observations as well as data that arrive as a stream, both of which are phenomena that many traditional statistical techniques are unable to contend with. Unfortunately, many of these traditional techniques are useful and cannot be discarded. One such technique is the Kolmogorov-Smirnov (KS) test for goodness-of-fit (GoF). A Big Data and stream-appropriate KS-type test is derived via the chunked-and-averaged (CA) estimator paradigm. The new test is termed the CAKS GoF test. The CAKS test statistic is proved to be asymptotically normal, allowing for the large sample testing of GoF. Furthermore, theoretical results demonstrate that the CAKS test is consistent against both fixed alternatives, where the null and the true data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Advanced Database Systems and Queries · Algorithms and Data Compression
