SDCOR: Scalable Density-based Clustering for Local Outlier Detection in   Massive-Scale Datasets

Sayyed Ahmad Naghavi Nozad; Maryam Amir Haeri; Gianluigi Folino

arXiv:2006.07616·cs.LG·July 6, 2021

SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets

Sayyed Ahmad Naghavi Nozad, Maryam Amir Haeri, Gianluigi Folino

PDF

1 Repo

TL;DR

This paper introduces SDCOR, a scalable density-based clustering method for local outlier detection in massive datasets, which processes data in chunks and efficiently identifies outliers with low memory usage.

Contribution

The paper proposes a novel batch-wise clustering approach that scales to large datasets and accurately detects outliers without requiring all data in memory.

Findings

01

Low linear time complexity demonstrated on real and synthetic data

02

More effective than traditional density-based methods

03

Outperforms some fast distance-based methods in efficiency

Abstract

This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets. Unlike the well-known traditional algorithms, which assume that all the data is memory-resident, our proposed method is scalable and processes the input data chunk-by-chunk within the confines of a limited memory buffer. A temporary clustering model is built at the first phase; then, it is gradually updated by analyzing consecutive memory loads of points. Subsequently, at the end of scalable clustering, the approximate structure of the original clusters is obtained. Finally, by another scan of the entire dataset and using a suitable criterion, an outlying score is assigned to each object called SDCOR (Scalable Density-based Clustering Outlierness Ratio). Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sana33/SDCOR
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.