An Empirical Security Evaluation of LLM-Generated Cryptographic Rust Code
Mohamed Elsayed, Kenneth Fulton, Jeong Yang

TL;DR
This study empirically evaluates the security of 240 LLM-generated Rust cryptographic codes, revealing low compilation success and vulnerabilities, highlighting limitations of current code generation and analysis tools.
Contribution
It provides the first large-scale empirical analysis of LLM-generated cryptographic Rust code, demonstrating significant vulnerabilities and the inadequacy of general static analysis tools.
Findings
Only 23.3% of generated code compiled successfully.
Crypto-specific analyzer found vulnerabilities in 57% of compiled samples.
Prompt strategy significantly affects code quality, with chain-of-thought prompting performing worse.
Abstract
Developers and organizations are using Large Language Models (LLMs) to generate security-critical code more frequently than ever, including cryptographic solutions for their products. This study presents an empirical evaluation of cryptographic security in 240 Rust code samples for two crypto algorithms (AES-256-GCM and ChaCha20-Poly1305) generated by three LLMs (Gemini 2.5 Pro, GPT-4o, and DeepSeek Coder) using four different prompt strategies. For each successfully compiled code sample, CodeQL static analysis and our rule-based crypto-specific analyzer were used to detect vulnerabilities, which are also associated with Common Weakness Enumeration (CWE). The evaluation results revealed that only 23.3% of the generated code samples were successfully compiled. Among the compiled code, CodeQL produced only two false positives, while our rule-based crypto-specific analyzer identified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
