LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution
Christopher G. Pedraza Pohlenz, Hassan Jalil Hadi, Ali Hassan, Ali Shoker

TL;DR
This paper introduces LCC-LLM, a code-centric framework with a large dataset for malware attribution, enhancing LLM reliability through evidence grounding and multi-task static analysis.
Contribution
It presents a new dataset and a multi-layered framework that improves malware attribution accuracy and reliability using code representations and retrieval-augmented reasoning.
Findings
Achieved an average semantic similarity of 0.634 across 43 malware tasks.
Grounded pipeline passes 10/10 in structured analysis for MalwareBazaar samples.
Improved factual reliability and decision support in malware analysis using LLMs.
Abstract
LLMs are increasingly explored for malware analysis; however, current LLM-based malware attribution remains limited by unsupported indicators and insufficient code-level grounding for identifying malicious and vulnerable code segments. To address these limitations, this research introduces LCC-LLM, a code-centric benchmark dataset and evidence-grounded framework for malware attribution and multi-task static malware analysis. The proposed LCCD dataset contains approximately 34K PE samples processed through a large-scale reverse-engineering pipeline and represented using decompiled C code, assembly code, CFG/FCG artifacts, hexadecimal data, PE metadata, suspicious API evidence, and structural features. Beyond dataset construction, LCC-LLM integrates LangGraph-orchestrated static analysis with multi-source cybersecurity knowledge to support evidence-grounded malware reasoning. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
