SBAN: A Framework & Multi-Dimensional Dataset for Large Language Model Pre-Training and Software Code Mining
Hamed Jelodar, Mohammad Meymani, Samita Bai, and Roozbeh Razavi-Far, Ali A. Ghorbani

TL;DR
SBAN is a comprehensive multi-modal dataset with over 3 million samples across code, binary, assembly, and natural language, designed to enhance large language model pre-training and software analysis tasks.
Contribution
The paper introduces SBAN, a large-scale, multi-dimensional dataset that bridges low-level code representations and high-level semantics for advanced software code analysis.
Findings
Enables cross-representation learning and semantic understanding of software.
Supports malware detection, code translation, and explanation tasks.
Facilitates scalable training of deep language models for software mining.
Abstract
This paper introduces SBAN (Source code, Binary, Assembly, and Natural Language Description), a large-scale, multi-dimensional dataset designed to advance the pre-training and evaluation of large language models (LLMs) for software code analysis. SBAN comprises more than 3 million samples, including 2.9 million benign and 672,000 malware respectively, each represented across four complementary layers: binary code, assembly instructions, natural language descriptions, and source code. This unique multimodal structure enables research on cross-representation learning, semantic understanding of software, and automated malware detection. Beyond security applications, SBAN supports broader tasks such as code translation, code explanation, and other software mining tasks involving heterogeneous data. It is particularly suited for scalable training of deep models, including transformers and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
