Comparison of Outlier Detection Algorithms on String Data
Philip Maus

TL;DR
This paper compares two novel algorithms for detecting outliers in string data, one based on a modified local outlier factor using Levenshtein distance and another on hierarchical regular expressions, demonstrating their effectiveness in different scenarios.
Contribution
Introduces a variant of local outlier factor tailored for string data and a new hierarchical regular expression-based outlier detection algorithm, with experimental comparison.
Findings
Regular expression-based method excels with structurally distinct outliers.
Levenshtein-based LOF performs well when outliers have different edit distances.
Both algorithms can conceptually identify string data outliers.
Abstract
Outlier detection is a well-researched and crucial problem in machine learning. However, there is little research on string data outlier detection, as most literature focuses on outlier detection of numerical data. A robust string data outlier detection algorithm could assist with data cleaning or anomaly detection in system log files. In this thesis, we compare two string outlier detection algorithms. Firstly, we introduce a variant of the well-known local outlier factor algorithm, which we tailor to detect outliers on string data using the Levenshtein measure to calculate the density of the dataset. We present a differently weighted Levenshtein measure, which considers hierarchical character classes and can be used to tune the algorithm to a specific string dataset. Secondly, we introduce a new kind of outlier detection algorithm based on the hierarchical left regular expression…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Machine Learning and Data Classification · Network Security and Intrusion Detection
