A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

Richard J. Young; Gregory D. Moody

arXiv:2605.03179·cs.CR·May 6, 2026

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

Richard J. Young, Gregory D. Moody

PDF

TL;DR

This paper introduces a reliable classification system for prompts related to malicious code, distinguishing between executable weapons and security knowledge, to improve safety evaluations of language models.

Contribution

It presents a consensus protocol involving multiple large-language-model judges to reliably categorize prompts, creating a validated prompt bank for code safety assessment.

Findings

01

Achieved almost perfect inter-rater agreement with Fleiss' kappa of 0.876.

02

Produced a consensus-labeled prompt bank of 1,554 prompts.

03

Demonstrated high reliability in weapons-vs-knowledge classification across multiple models.

Abstract

Existing benchmarks of language-model refusal on malicious-coding tasks routinely conflate requests for executable malicious software with requests for harmful security knowledge. This conflation matters because the two request types plausibly trigger distinct refusal pathways in safety-aligned language models, and a single refusal-rate statistic computed over a mixture cannot isolate either. This paper introduces a weapons-versus-knowledge classification axis, operationalized through a five-model consensus protocol, and applies it to 3,133 prompts drawn from four public benchmarks, yielding a 1,554-prompt consensus-CODE bank (the primary released artifact) and a 388-prompt consensus-KNOWLEDGE comparison set used by the companion benchmark paper. The consensus pipeline uses five large-language-model judges spanning four vendor families (Anthropic, OpenAI, Google, Zhipu AI, Alibaba),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.