Sparx: Distributed Outlier Detection at Scale
Sean Zhang, Varun Ursekar, Leman Akoglu

TL;DR
Sparx is a scalable, distributed outlier detection algorithm implemented in Apache Spark, capable of handling billions of points and high-dimensional data efficiently, filling a critical gap in practical large-scale OD solutions.
Contribution
This paper introduces Sparx, a novel distributed outlier detection algorithm designed for shared-nothing architectures, with open-source implementation and extensive validation on large real-world datasets.
Findings
Sparx scales effectively to billions of data points
Existing solutions struggle with high dimensionality or large datasets
Sparx outperforms other open-source OD methods in scalability and effectiveness
Abstract
There is no shortage of outlier detection (OD) algorithms in the literature, yet a vast body of them are designed for a single machine. With the increasing reality of already cloud-resident datasets comes the need for distributed OD techniques. This area, however, is not only understudied but also short of public-domain implementations for practical use. This paper aims to fill this gap: We design Sparx, a data-parallel OD algorithm suitable for shared-nothing infrastructures, which we specifically implement in Apache Spark. Through extensive experiments on three real-world datasets, with several billions of points and millions of features, we show that existing open-source solutions fail to scale up; either by large number of points or high dimensionality, whereas Sparx yields scalable and effective performance. To facilitate practical use of OD on modern-scale datasets, we open-source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
