A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
Richard J. Young, Gregory D. Moody

TL;DR
This paper introduces a reliable classification system for prompts related to malicious code, distinguishing between executable weapons and security knowledge, to improve safety evaluations of language models.
Contribution
It presents a consensus protocol involving multiple large-language-model judges to reliably categorize prompts, creating a validated prompt bank for code safety assessment.
Findings
Achieved almost perfect inter-rater agreement with Fleiss' kappa of 0.876.
Produced a consensus-labeled prompt bank of 1,554 prompts.
Demonstrated high reliability in weapons-vs-knowledge classification across multiple models.
Abstract
Existing benchmarks of language-model refusal on malicious-coding tasks routinely conflate requests for executable malicious software with requests for harmful security knowledge. This conflation matters because the two request types plausibly trigger distinct refusal pathways in safety-aligned language models, and a single refusal-rate statistic computed over a mixture cannot isolate either. This paper introduces a weapons-versus-knowledge classification axis, operationalized through a five-model consensus protocol, and applies it to 3,133 prompts drawn from four public benchmarks, yielding a 1,554-prompt consensus-CODE bank (the primary released artifact) and a 388-prompt consensus-KNOWLEDGE comparison set used by the companion benchmark paper. The consensus pipeline uses five large-language-model judges spanning four vendor families (Anthropic, OpenAI, Google, Zhipu AI, Alibaba),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
