TL;DR
This paper investigates the robustness of code language models against multi-turn malicious prompts, introduces a new benchmark for evaluation, and demonstrates that fine-tuning improves their ability to reject adversarial inputs.
Contribution
The paper introduces MOCHA, a benchmark for evaluating code LLM robustness against multi-turn malicious prompts, and shows fine-tuning enhances model resilience without sacrificing coding performance.
Findings
Models remain vulnerable to multi-turn malicious prompts.
Fine-tuning on MOCHA increases rejection rates by up to 32.4%.
Fine-tuning maintains coding ability while improving robustness.
Abstract
Recent advancements in Large Language Models (LLMs) have significantly enhanced their code generation capabilities. However, their robustness against adversarial misuse, particularly through multi-turn malicious coding prompts, remains underexplored. In this work, we introduce code decomposition attacks, where a malicious coding task is broken down into a series of seemingly benign subtasks across multiple conversational turns to evade safety filters. To facilitate systematic evaluation, we introduce \benchmarkname{}, a large-scale benchmark designed to evaluate the robustness of code LLMs against both single-turn and multi-turn malicious prompts. Empirical results across open- and closed-source models reveal persistent vulnerabilities, especially under multi-turn scenarios. Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
