SBAN: A Framework & Multi-Dimensional Dataset for Large Language Model Pre-Training and Software Code Mining

Hamed Jelodar; Mohammad Meymani; Samita Bai; and Roozbeh Razavi-Far; Ali A. Ghorbani

arXiv:2510.18936·cs.IR·October 28, 2025

SBAN: A Framework & Multi-Dimensional Dataset for Large Language Model Pre-Training and Software Code Mining

Hamed Jelodar, Mohammad Meymani, Samita Bai, and Roozbeh Razavi-Far, Ali A. Ghorbani

PDF

TL;DR

SBAN is a comprehensive multi-modal dataset with over 3 million samples across code, binary, assembly, and natural language, designed to enhance large language model pre-training and software analysis tasks.

Contribution

The paper introduces SBAN, a large-scale, multi-dimensional dataset that bridges low-level code representations and high-level semantics for advanced software code analysis.

Findings

01

Enables cross-representation learning and semantic understanding of software.

02

Supports malware detection, code translation, and explanation tasks.

03

Facilitates scalable training of deep language models for software mining.

Abstract

This paper introduces SBAN (Source code, Binary, Assembly, and Natural Language Description), a large-scale, multi-dimensional dataset designed to advance the pre-training and evaluation of large language models (LLMs) for software code analysis. SBAN comprises more than 3 million samples, including 2.9 million benign and 672,000 malware respectively, each represented across four complementary layers: binary code, assembly instructions, natural language descriptions, and source code. This unique multimodal structure enables research on cross-representation learning, semantic understanding of software, and automated malware detection. Beyond security applications, SBAN supports broader tasks such as code translation, code explanation, and other software mining tasks involving heterogeneous data. It is particularly suited for scalable training of deep models, including transformers and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.