Data Discovery and Anomaly Detection Using Atypicality: Theory
Anders H{\o}st-Madsen, Elyas Sabeti, Chad Walton

TL;DR
This paper introduces a theoretical framework for identifying atypical data points that deviate from the norm by being more efficiently encodable, and demonstrates its application to real-world datasets.
Contribution
It provides an axiomatic definition of atypicality and develops a universal coding-based method for anomaly detection in big data.
Findings
Effective detection of atypical data points in real datasets
Theoretical validation of the atypicality concept
Implementation using universal source coding
Abstract
A central question in the era of 'big data' is what to do with the enormous amount of information. One possibility is to characterize it through statistics, e.g., averages, or classify it using machine learning, in order to understand the general structure of the overall data. The perspective in this paper is the opposite, namely that most of the value in the information in some applications is in the parts that deviate from the average, that are unusual, atypical. We define what we mean by 'atypical' in an axiomatic way as data that can be encoded with fewer bits in itself rather than using the code for the typical data. We show that this definition has good theoretical properties. We then develop an implementation based on universal source coding, and apply this to a number of real world data sets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Fractal and DNA sequence analysis · Algorithms and Data Compression
