PDFInspect: A Unified Feature Extraction Framework for Malicious Document Detection
Sharmila S P

TL;DR
This paper introduces PDFInspect, a comprehensive feature extraction framework that combines graph analysis, metadata, and structural features to improve malicious PDF detection and analysis.
Contribution
It presents a unified, scalable framework integrating multiple feature types for enhanced detection of malicious PDFs, supporting real-world threat intelligence workflows.
Findings
High-dimensional feature vectors improve malware classification accuracy.
Framework effectively captures document complexity and behavioral signatures.
Supports anomaly detection and forensic analysis of PDFs.
Abstract
The increasing prevalence of malicious Portable Document Format (PDF) files necessitates robust and comprehensive feature extraction techniques for effective detection and analysis. This work presents a unified framework that integrates graph-based, structural, and metadata-driven analysis to generate a rich feature representation for each PDF document. The system extracts text from PDF pages and constructs undirected graphs based on pairwise word relationships, enabling the computation of graph-theoretic features such as node count, edge density, and clustering coefficient. Simultaneously, the framework parses embedded metadata to quantify character distributions, entropy patterns, and inconsistencies across fields such as author, title, and producer. Temporal features are derived from creation and modification timestamps to capture behavioral signatures, while structural elements…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Spam and Phishing Detection · Authorship Attribution and Profiling
