APT-ClaritySet: A Large-Scale, High-Fidelity Labeled Dataset for APT Malware with Alias Normalization and Graph-Based Deduplication
Zhenhao Yin, Hanbing Yan, Huishu Lu, Jing Xiong, Xiangyu Li, Rui Mei, Tianning Zang

TL;DR
This paper introduces APT-ClaritySet, a comprehensive, high-fidelity dataset for APT malware that normalizes actor aliases and reduces redundancy through graph-based deduplication, facilitating reproducible research.
Contribution
It presents a large-scale, standardized APT malware dataset with alias normalization and scalable deduplication, enabling more accurate analysis of APT patterns and evolution.
Findings
Normalized approximately 11.22% of inconsistent actor aliases.
Reduced redundant samples by 47.55% through graph-based deduplication.
Provided a function-reuse clustering resource for analyzing malware evolution.
Abstract
Large-scale, standardized datasets for Advanced Persistent Threat (APT) research are scarce, and inconsistent actor aliases and redundant samples hinder reproducibility. This paper presents APT-ClaritySet and its construction pipeline that normalizes threat actor aliases (reconciling approximately 11.22\% of inconsistent names) and applies graph-feature deduplication -- reducing the subset of statically analyzable executables by 47.55\% while retaining behaviorally distinct variants. APT-ClaritySet comprises: (i) APT-ClaritySet-Full, the complete pre-deduplication collection with 34{,}363 malware samples attributed to 305 APT groups (2006 - early 2025); (ii) APT-ClaritySet-Unique, the deduplicated release with 25{,}923 unique samples spanning 303 groups and standardized attributions; and (iii) APT-ClaritySet-FuncReuse, a function-level resource that includes 324{,}538 function-reuse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Information and Cyber Security · Security and Verification in Computing
