CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation
Jinjun Peng, Leyi Cui, Kele Huang, Junfeng Yang, Baishakhi Ray

TL;DR
CWEval introduces an outcome-driven framework and benchmark for evaluating both the functionality and security of code generated by large language models, addressing previous benchmarks' limitations and revealing security issues in LLM outputs.
Contribution
The paper presents CWEval, a novel evaluation framework and multilingual benchmark that accurately assesses both functionality and security in LLM-generated code.
Findings
A significant portion of functionally correct code is insecure.
Previous benchmarks are inaccurate in evaluating security.
CWEval reveals security flaws overlooked by prior methods.
Abstract
Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing, enhancing productivity across various tasks. While identifying incorrect code is often straightforward, detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge, which poses considerable security risks of using LLM-generated code and underscores the need for robust evaluation benchmarks that assess both functional correctness and security. Current benchmarks like CyberSecEval and SecurityEval attempt to solve it but are hindered by unclear and impractical specifications, failing to assess both functionality and security accurately. To tackle these deficiencies, we introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Adversarial Robustness in Machine Learning
