PDFInspect: A Unified Feature Extraction Framework for Malicious Document Detection

Sharmila S P

arXiv:2601.12866·cs.CR·January 21, 2026

PDFInspect: A Unified Feature Extraction Framework for Malicious Document Detection

Sharmila S P

PDF

Open Access

TL;DR

This paper introduces PDFInspect, a comprehensive feature extraction framework that combines graph analysis, metadata, and structural features to improve malicious PDF detection and analysis.

Contribution

It presents a unified, scalable framework integrating multiple feature types for enhanced detection of malicious PDFs, supporting real-world threat intelligence workflows.

Findings

01

High-dimensional feature vectors improve malware classification accuracy.

02

Framework effectively captures document complexity and behavioral signatures.

03

Supports anomaly detection and forensic analysis of PDFs.

Abstract

The increasing prevalence of malicious Portable Document Format (PDF) files necessitates robust and comprehensive feature extraction techniques for effective detection and analysis. This work presents a unified framework that integrates graph-based, structural, and metadata-driven analysis to generate a rich feature representation for each PDF document. The system extracts text from PDF pages and constructs undirected graphs based on pairwise word relationships, enabling the computation of graph-theoretic features such as node count, edge density, and clustering coefficient. Simultaneously, the framework parses embedded metadata to quantify character distributions, entropy patterns, and inconsistencies across fields such as author, title, and producer. Temporal features are derived from creation and modification timestamps to capture behavioral signatures, while structural elements…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Spam and Phishing Detection · Authorship Attribution and Profiling