Uncovering Semantics and Topics Utilized by Threat Actors to Deliver Malicious Attachments and URLs
Andrey Yakymovych, Abhishek Singh

TL;DR
This paper uses advanced unsupervised topic modeling and semantic analysis to uncover common themes and semantics in malicious emails, enhancing threat detection capabilities.
Contribution
It introduces a novel application of multilingual embedding models and clustering algorithms for semantic analysis of malicious email content, revealing threat actor patterns.
Findings
Identifies common semantics in malicious emails
Compares clustering algorithms for effectiveness
Provides insights into threat actor themes
Abstract
Recent threat reports highlight that email remains the top vector for delivering malware to endpoints. Despite these statistics, detecting malicious email attachments and URLs often neglects semantic cues linguistic features and contextual clues. Our study employs BERTopic unsupervised topic modeling to identify common semantics and themes embedded in email to deliver malicious attachments and call-to-action URLs. We preprocess emails by extracting and sanitizing content and employ multilingual embedding models like BGE-M3 for dense representations, which clustering algorithms(HDBSCAN and OPTICS) use to group emails by semantic similarity. Phi3-Mini-4K-Instruct facilitates semantic and hLDA aid in thematic analysis to understand threat actor patterns. Our research will evaluate and compare different clustering algorithms on topic quantity, coherence, and diversity metrics, concluding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Cybercrime and Law Enforcement Studies · Information and Cyber Security
