Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)
Richard J. Young, Gregory D. Moody

TL;DR
This systematic review analyzes thirteen publicly released prompt corpora for evaluating large language models' refusal to engage in malicious coding tasks, highlighting methodological gaps and proposing directions for future research.
Contribution
It uniquely treats prompt datasets as the primary unit of analysis and provides a comprehensive synthesis of their construction, taxonomy, and validation methods.
Findings
Identified lack of human-annotator baselines for calibration.
Highlighted absence of cross-corpus comparability in refusal rates.
Noted fragmentation in malware-category taxonomies.
Abstract
The evaluation of large language model refusal on malicious-coding tasks now spans at least thirteen publicly released prompt corpora (AdvBench, the CyberSecEval family, RMCBench, RedCode, MCGMark, JailbreakBench, CySecBench, MalwareBench, CIRCLE, MOCHA, ASTRA, Scam2Prompt / Innoc2Scam-bench, and JAWS-Bench), each constructed under a different protocol, released under different licensing terms, and validated (or not) against different inter-rater reliability standards. Existing surveys treat code security, jailbreak taxonomy, or vulnerability detection as the central object and mention these corpora only in passing. This paper reverses that framing: it treats the prompt datasets themselves as the unit of analysis. Following a PRISMA-style protocol, we specify a search strategy, screen the recent literature on coding-LLM refusal evaluation, apply a uniform extraction template to each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
