APT-ClaritySet: A Large-Scale, High-Fidelity Labeled Dataset for APT Malware with Alias Normalization and Graph-Based Deduplication

Zhenhao Yin; Hanbing Yan; Huishu Lu; Jing Xiong; Xiangyu Li; Rui Mei; Tianning Zang

arXiv:2512.15039·cs.CR·December 18, 2025

APT-ClaritySet: A Large-Scale, High-Fidelity Labeled Dataset for APT Malware with Alias Normalization and Graph-Based Deduplication

Zhenhao Yin, Hanbing Yan, Huishu Lu, Jing Xiong, Xiangyu Li, Rui Mei, Tianning Zang

PDF

Open Access

TL;DR

This paper introduces APT-ClaritySet, a comprehensive, high-fidelity dataset for APT malware that normalizes actor aliases and reduces redundancy through graph-based deduplication, facilitating reproducible research.

Contribution

It presents a large-scale, standardized APT malware dataset with alias normalization and scalable deduplication, enabling more accurate analysis of APT patterns and evolution.

Findings

01

Normalized approximately 11.22% of inconsistent actor aliases.

02

Reduced redundant samples by 47.55% through graph-based deduplication.

03

Provided a function-reuse clustering resource for analyzing malware evolution.

Abstract

Large-scale, standardized datasets for Advanced Persistent Threat (APT) research are scarce, and inconsistent actor aliases and redundant samples hinder reproducibility. This paper presents APT-ClaritySet and its construction pipeline that normalizes threat actor aliases (reconciling approximately 11.22\% of inconsistent names) and applies graph-feature deduplication -- reducing the subset of statically analyzable executables by 47.55\% while retaining behaviorally distinct variants. APT-ClaritySet comprises: (i) APT-ClaritySet-Full, the complete pre-deduplication collection with 34{,}363 malware samples attributed to 305 APT groups (2006 - early 2025); (ii) APT-ClaritySet-Unique, the deduplicated release with 25{,}923 unique samples spanning 303 groups and standardized attributions; and (iii) APT-ClaritySet-FuncReuse, a function-level resource that includes 324{,}538 function-reuse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Information and Cyber Security · Security and Verification in Computing