Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code

Dominik Blain; Maxime Noiseux

arXiv:2604.05292·cs.CR·April 9, 2026

Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code

Dominik Blain, Maxime Noiseux

PDF

TL;DR

This study formally verifies 3,500 AI-generated code artifacts for security vulnerabilities using Z3 SMT solver, revealing most contain exploitable flaws and current tools miss most issues.

Contribution

It introduces a formal verification pipeline for AI-generated code, providing quantitative analysis of vulnerabilities across multiple models and benchmarks.

Findings

01

55.8% of artifacts contain vulnerabilities identified by COBALT.

02

No AI model achieved a grade better than D in security robustness.

03

Industry tools miss 97.8% of Z3-proven vulnerabilities.

Abstract

AI coding assistants are now used to generate production code in security-sensitive domains, yet the exploitability of their outputs remains unquantified. We address this gap with Broken by Default: a formal verification study of 3,500 code artifacts generated by seven widely-deployed LLMs across 500 security-critical prompts (five CWE categories, 100 prompts each). Each artifact is subjected to the Z3 SMT solver via the COBALT analysis pipeline, producing mathematical satisfiability witnesses rather than pattern-based heuristics. Across all models, 55.8% of artifacts contain at least one COBALT-identified vulnerability; of these, 1,055 are formally proven via Z3 satisfiability witnesses. GPT-4o leads at 62.4% (grade F); Gemini 2.5 Flash performs best at 48.4% (grade D). No model achieves a grade better than D. Six of seven representative findings are confirmed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.