LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution

Christopher G. Pedraza Pohlenz; Hassan Jalil Hadi; Ali Hassan; Ali Shoker

arXiv:2605.05807·cs.CR·May 8, 2026

LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution

Christopher G. Pedraza Pohlenz, Hassan Jalil Hadi, Ali Hassan, Ali Shoker

PDF

TL;DR

This paper introduces LCC-LLM, a code-centric framework with a large dataset for malware attribution, enhancing LLM reliability through evidence grounding and multi-task static analysis.

Contribution

It presents a new dataset and a multi-layered framework that improves malware attribution accuracy and reliability using code representations and retrieval-augmented reasoning.

Findings

01

Achieved an average semantic similarity of 0.634 across 43 malware tasks.

02

Grounded pipeline passes 10/10 in structured analysis for MalwareBazaar samples.

03

Improved factual reliability and decision support in malware analysis using LLMs.

Abstract

LLMs are increasingly explored for malware analysis; however, current LLM-based malware attribution remains limited by unsupported indicators and insufficient code-level grounding for identifying malicious and vulnerable code segments. To address these limitations, this research introduces LCC-LLM, a code-centric benchmark dataset and evidence-grounded framework for malware attribution and multi-task static malware analysis. The proposed LCCD dataset contains approximately 34K PE samples processed through a large-scale reverse-engineering pipeline and represented using decompiled C code, assembly code, CFG/FCG artifacts, hexadecimal data, PE metadata, suspicious API evidence, and structural features. Beyond dataset construction, LCC-LLM integrates LangGraph-orchestrated static analysis with multi-source cybersecurity knowledge to support evidence-grounded malware reasoning. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.