Data Discovery and Anomaly Detection Using Atypicality: Theory

Anders H{\o}st-Madsen; Elyas Sabeti; Chad Walton

arXiv:1709.03189·cs.IT·September 12, 2017·2 cites

Data Discovery and Anomaly Detection Using Atypicality: Theory

Anders H{\o}st-Madsen, Elyas Sabeti, Chad Walton

PDF

Open Access

TL;DR

This paper introduces a theoretical framework for identifying atypical data points that deviate from the norm by being more efficiently encodable, and demonstrates its application to real-world datasets.

Contribution

It provides an axiomatic definition of atypicality and develops a universal coding-based method for anomaly detection in big data.

Findings

01

Effective detection of atypical data points in real datasets

02

Theoretical validation of the atypicality concept

03

Implementation using universal source coding

Abstract

A central question in the era of 'big data' is what to do with the enormous amount of information. One possibility is to characterize it through statistics, e.g., averages, or classify it using machine learning, in order to understand the general structure of the overall data. The perspective in this paper is the opposite, namely that most of the value in the information in some applications is in the parts that deviate from the average, that are unusual, atypical. We define what we mean by 'atypical' in an axiomatic way as data that can be encoded with fewer bits in itself rather than using the code for the typical data. We show that this definition has good theoretical properties. We then develop an implementation based on universal source coding, and apply this to a number of real world data sets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Fractal and DNA sequence analysis · Algorithms and Data Compression