Shellcode_IA32: A Dataset for Automatic Shellcode Generation
Pietro Liguori, Erfan Al-Hossami, Domenico Cotroneo, Roberto Natella,, Bojan Cukic, Samira Shaikh

TL;DR
This paper introduces Shellcode_IA32, a new dataset linking assembly instructions with natural language descriptions to facilitate automatic shellcode generation using neural machine translation techniques.
Contribution
It provides the first dataset of its kind for shellcode generation and establishes baseline performance with standard NMT methods.
Findings
Baseline performance levels established for shellcode generation
Dataset includes challenging and common assembly instructions
First dataset linking assembly code with natural language descriptions
Abstract
We take the first step to address the task of automatically generating shellcodes, i.e., small pieces of code used as a payload in the exploitation of a software vulnerability, starting from natural language comments. We assemble and release a novel dataset (Shellcode_IA32), consisting of challenging but common assembly instructions with their natural language descriptions. We experiment with standard methods in neural machine translation (NMT) to establish baseline performance levels on this task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Adam · Sequence to Sequence · Long Short-Term Memory · Bidirectional LSTM
