Sparx: Distributed Outlier Detection at Scale

Sean Zhang; Varun Ursekar; Leman Akoglu

arXiv:2206.01281·cs.DC·June 6, 2022

Sparx: Distributed Outlier Detection at Scale

Sean Zhang, Varun Ursekar, Leman Akoglu

PDF

TL;DR

Sparx is a scalable, distributed outlier detection algorithm implemented in Apache Spark, capable of handling billions of points and high-dimensional data efficiently, filling a critical gap in practical large-scale OD solutions.

Contribution

This paper introduces Sparx, a novel distributed outlier detection algorithm designed for shared-nothing architectures, with open-source implementation and extensive validation on large real-world datasets.

Findings

01

Sparx scales effectively to billions of data points

02

Existing solutions struggle with high dimensionality or large datasets

03

Sparx outperforms other open-source OD methods in scalability and effectiveness

Abstract

There is no shortage of outlier detection (OD) algorithms in the literature, yet a vast body of them are designed for a single machine. With the increasing reality of already cloud-resident datasets comes the need for distributed OD techniques. This area, however, is not only understudied but also short of public-domain implementations for practical use. This paper aims to fill this gap: We design Sparx, a data-parallel OD algorithm suitable for shared-nothing infrastructures, which we specifically implement in Apache Spark. Through extensive experiments on three real-world datasets, with several billions of points and millions of features, we show that existing open-source solutions fail to scale up; either by large number of points or high dimensionality, whereas Sparx yields scalable and effective performance. To facilitate practical use of OD on modern-scale datasets, we open-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.