RawMal-TF: Raw Malware Dataset Labeled by Type and Family
David B\'alik, Martin Jure\v{c}ek, Mark Stamp

TL;DR
This paper introduces RawMal-TF, a comprehensive malware dataset labeled by type and family, and evaluates machine learning models for malware detection and classification using static analysis features.
Contribution
The work presents a new large-scale malware dataset with dual-level labels and a unified static feature extraction pipeline for improved malware classification.
Findings
High accuracy in binary malware detection (up to 98.98%)
Strong performance in interclass classification (up to 97.5%)
Effective models even with limited data (97.6% detection with 1,000 samples)
Abstract
This work addresses the challenge of malware classification using machine learning by developing a novel dataset labeled at both the malware type and family levels. Raw binaries were collected from sources such as VirusShare, VX Underground, and MalwareBazaar, and subsequently labeled with family information parsed from binary names and type-level labels integrated from ClarAVy. The dataset includes 14 malware types and 17 malware families, and was processed using a unified feature extraction pipeline based on static analysis, particularly extracting features from Portable Executable headers, to support advanced classification tasks. The evaluation was focused on three key classification tasks. In the binary classification of malware versus benign samples, Random Forest and XGBoost achieved high accuracy on the full datasets, reaching 98.5% for type-based detection and 98.98% for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Network Security and Intrusion Detection
