MalwarePT: A Binary-Level Foundation Model for Malware Analysis

Saastha Vasan; Yuzhou Nie; Kaie Chen; Yigitcan Kaya; Hojjat Aghakhani; Roman Vasilenko; Wenbo Guo; Christopher Kruegel; Giovanni Vigna

arXiv:2605.16455·cs.CR·May 19, 2026

MalwarePT: A Binary-Level Foundation Model for Malware Analysis

Saastha Vasan, Yuzhou Nie, Kaie Chen, Yigitcan Kaya, Hojjat Aghakhani, Roman Vasilenko, Wenbo Guo, Christopher Kruegel, Giovanni Vigna

PDF

TL;DR

MalwarePT is a binary-level foundation model pretrained on Windows PE code-section bytes, demonstrating improved performance across malware analysis tasks at different granularities.

Contribution

Introduces MalwarePT, a binary-level foundation model with byte-pair encoding tokenization, capable of transfer learning across multiple malware analysis tasks.

Findings

01

Pretraining significantly improves API call prediction and functionality classification.

02

Increasing BPE vocabulary size enhances performance, with optimal at 1,024 tokens.

03

MalwarePT outperforms neural network baselines in malware detection at low FPR.

Abstract

Automated malware analysis increasingly relies on machine learning, yet most existing methods remain task-specific and depend on handcrafted features or narrowly scoped models. Recent developments in binary-level foundation models suggest a path toward reusable program representations, but their application to malware analysis remains underexplored, and most still operate at byte-level tokenization, limiting their ability to capture multi-byte code patterns. In this work, we introduce MalwarePT, a binary-level foundation model for malware analysis built on a ModernBERT-style encoder and pretrained with masked language modeling on Windows PE code-section bytes. We study whether a single pretrained encoder can transfer across malware-analysis tasks at different granularities, and how tokenization design affects that transfer. We train a byte-pair encoding (BPE) tokenizer on code-section…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.