Surgical Repair of Insecure Code Generation in LLMs

Gustavo Sandoval; Brendan Dolan-Gavitt; Siddharth Garg

arXiv:2604.16697·cs.CR·April 21, 2026

Surgical Repair of Insecure Code Generation in LLMs

Gustavo Sandoval, Brendan Dolan-Gavitt, Siddharth Garg

PDF

TL;DR

This paper investigates why large language models generate insecure code despite understanding vulnerabilities, revealing a layer-specific encoding issue and proposing a targeted fix that significantly reduces insecure outputs.

Contribution

It identifies the cause of insecure code generation as a layer-specific encoding problem and introduces a steering method that reduces vulnerabilities across multiple models and architectures.

Findings

01

Per-vulnerability steering vectors reduce insecure code generation by up to 74%.

02

Security representations are encoded early but only become active at the final layer.

03

The problem is an interpretability issue, not a training artifact.

Abstract

Large language models write production code, and yet they routinely introduce well-known vulnerabilities. We show that this is not a knowledge deficit: the same models that generate insecure code, correctly identify and explain the vulnerability when asked directly, this is a gap we call the Format-Reliability Gap. Mechanistic analysis reveals the cause: security representations are encoded from the earliest layers but remain computationally inert until the final layer, where format-compliance demands compete with them. Because the failure is localized to a single layer, per-vulnerability steering vectors reduce insecure generation by up to 74% with negligible overhead. The mechanism and the fix generalize across five models, three architecture families, and six vulnerability types, suggesting insecure code generation is an interpretability problem, not a training artifact.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.