Automated Creation of Source Code Variants of a Cryptographic Hash   Function Implementation Using Generative Pre-Trained Transformer Models

Elijah Pelofske; Vincent Urias; Lorie M. Liebrock

arXiv:2404.15681·cs.CR·July 11, 2024

Automated Creation of Source Code Variants of a Cryptographic Hash Function Implementation Using Generative Pre-Trained Transformer Models

Elijah Pelofske, Vincent Urias, Lorie M. Liebrock

PDF

Open Access

TL;DR

This paper explores GPT models' ability to generate, analyze, and cluster numerous variants of SHA-1 hash function implementations, revealing both security risks and the potential for automated code diversification.

Contribution

It demonstrates the use of GPT models with context-aware prompting to produce a vast array of source code variants, including insecure and incorrect implementations, and introduces clustering techniques for code analysis.

Findings

01

Many generated variants are insecure or incorrect for some test vectors.

02

Over 100,000 function variants were clustered into equivalent groups.

03

Generated code includes serious flaws like memory leaks and overflows.

Abstract

Generative pre-trained transformers (GPT's) are a type of large language machine learning model that are unusually adept at producing novel, and coherent, natural language. In this study the ability of GPT models to generate novel and correct versions, and notably very insecure versions, of implementations of the cryptographic hash function SHA-1 is examined. The GPT models Llama-2-70b-chat-h, Mistral-7B-Instruct-v0.1, and zephyr-7b-alpha are used. The GPT models are prompted to re-write each function using a modified version of the localGPT framework and langchain to provide word embedding context of the full source code and header files to the model, resulting in over 150,000 function re-write GPT output text blocks, approximately 50,000 of which were able to be parsed as C code and subsequently compiled. The generated code is analyzed for being compilable, correctness of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Scientific Computing and Data Management · Digital and Cyber Forensics

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Linear Warmup With Cosine Annealing · Adam · Layer Normalization · Multi-Head Attention · Dropout · Attention Dropout