Semantic Preprocessing for LLM-based Malware Analysis

Benjamin Marais; Tony Quertier; Gr\'egoire Barrue

arXiv:2506.12113·cs.CR·October 6, 2025

Semantic Preprocessing for LLM-based Malware Analysis

Benjamin Marais, Tony Quertier, Gr\'egoire Barrue

PDF

Open Access

TL;DR

This paper introduces a semantic preprocessing method for malware analysis that leverages expert knowledge to create interpretable JSON reports, enhancing AI model explainability and achieving high classification performance.

Contribution

It proposes a novel preprocessing approach that combines static and behavioral features with expert knowledge for improved malware semantic analysis.

Findings

01

Achieved a weighted-average F1-score of 0.94 on a complex dataset.

02

Enhanced AI model interpretability for malware classification.

03

Integrated expert knowledge into preprocessing for better feature representation.

Abstract

In a context of malware analysis, numerous approaches rely on Artificial Intelligence to handle a large volume of data. However, these techniques focus on data view (images, sequences) and not on an expert's view. Noticing this issue, we propose a preprocessing that focuses on expert knowledge to improve malware semantic analysis and result interpretability. We propose a new preprocessing method which creates JSON reports for Portable Executable files. These reports gather features from both static and behavioral analysis, and incorporate packer signature detection, MITRE ATT\&CK and Malware Behavior Catalog (MBC) knowledge. The purpose of this preprocessing is to gather a semantic representation of binary files, understandable by malware analysts, and that can enhance AI models' explainability for malicious files analysis. Using this preprocessing to train a Large Language Model for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Digital and Cyber Forensics · Anomaly Detection Techniques and Applications

MethodsFocus