Refining Decision Boundaries In Anomaly Detection Using Similarity Search Within the Feature Space
Sidahmed Benabderrahmane, Petko Valtchev, James Cheney, Talal Rahwan

TL;DR
This paper presents SDA2E, a novel autoencoder with similarity-guided active learning strategies and a new similarity measure, significantly improving anomaly detection in highly imbalanced datasets like cybersecurity threats.
Contribution
Introduces SDA2E, a sparse autoencoder with a similarity-guided active learning framework and a new similarity measure, enhancing decision boundary refinement in anomaly detection.
Findings
Achieves superior ranking performance with nDCG up to 1.0.
Reduces labeled data requirement by up to 80%.
Outperforms 15 state-of-the-art methods across 52 datasets.
Abstract
Detecting rare and diverse anomalies in highly imbalanced datasets-such as Advanced Persistent Threats (APTs) in cybersecurity-remains a fundamental challenge for machine learning systems. Active learning offers a promising direction by strategically querying an oracle to minimize labeling effort, yet conventional approaches often fail to exploit the intrinsic geometric structure of the feature space for model refinement. In this paper, we introduce SDA2E, a Sparse Dual Adversarial Attention-based AutoEncoder designed to learn compact and discriminative latent representations from imbalanced, high-dimensional data. We further propose a similarity-guided active learning framework that integrates three novel strategies to refine decision boundaries efficiently: mormal-like expansion, which enriches the training set with points similar to labeled normals to improve reconstruction fidelity;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Imbalanced Data Classification Techniques · Software System Performance and Reliability
