Analyzing PDFs like Binaries: Adversarially Robust PDF Malware Analysis via Intermediate Representation and Language Model

Side Liu; Jiang Ming; Guodong Zhou; Xinyi Liu; Jianming Fu; Guojun Peng

arXiv:2506.17162·cs.CR·December 8, 2025

Analyzing PDFs like Binaries: Adversarially Robust PDF Malware Analysis via Intermediate Representation and Language Model

Side Liu, Jiang Ming, Guodong Zhou, Xinyi Liu, Jianming Fu, Guojun Peng

PDF

TL;DR

This paper introduces a novel PDF malware detection method using an intermediate representation and language models, significantly improving adversarial robustness and reducing false positives.

Contribution

It proposes PDFObj IR, a new semantic and structural feature extraction framework leveraging language models and graph analysis for robust PDF malware detection.

Findings

01

Achieves high adversarial robustness in PDF malware classification.

02

Maintains a false positive rate of only 0.07%.

03

Outperforms state-of-the-art classifiers on baseline datasets.

Abstract

Malicious PDF files have emerged as a persistent threat and become a popular attack vector in web-based attacks. While machine learning-based PDF malware classifiers have shown promise, these classifiers are often susceptible to adversarial attacks, undermining their reliability. To address this issue, recent studies have aimed to enhance the robustness of PDF classifiers. Despite these efforts, the feature engineering underlying these studies remains outdated. Consequently, even with the application of cutting-edge machine learning techniques, these approaches fail to fundamentally resolve the issue of feature instability. To tackle this, we propose a novel approach for PDF feature extraction and PDF malware detection. We introduce the PDFObj IR (PDF Object Intermediate Representation), an assembly-like language framework for PDF objects, from which we extract semantic features using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.