Maximizing Information Gain in Privacy-Aware Active Learning of Email Anomalies
Mu-Huan Miles Chung, Sharon Li, Jaturong Kongmanee, Lu Wang, Yuhong, Yang, Calvin Giang, Khilan Jerath, Abhay Raman, David Lie, Mark Chignell

TL;DR
This paper introduces an information gain maximizing heuristic for active learning in privacy-sensitive email anomaly detection, demonstrating improved performance with expert-labeled data and confidence-based sampling strategies.
Contribution
It develops a novel active learning method that maximizes information gain using analyst confidence, tailored for privacy-constrained email anomaly detection.
Findings
Information gain heuristic outperforms existing sampling methods.
Expert analyst labels improve model performance.
Calibrated confidence estimates are crucial for effective sampling.
Abstract
Redacted emails satisfy most privacy requirements but they make it more difficult to detect anomalous emails that may be indicative of data exfiltration. In this paper we develop an enhanced method of Active Learning using an information gain maximizing heuristic, and we evaluate its effectiveness in a real world setting where only redacted versions of email could be labeled by human analysts due to privacy concerns. In the first case study we examined how Active Learning should be carried out. We found that model performance was best when a single highly skilled (in terms of the labelling task) analyst provided the labels. In the second case study we used confidence ratings to estimate the labeling uncertainty of analysts and then prioritized instances for labeling based on the expected information gain (the difference between model uncertainty and analyst uncertainty) that would be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersonal Information Management and User Behavior · Internet Traffic Analysis and Secure E-voting · Network Security and Intrusion Detection
