CWEval: Outcome-driven Evaluation on Functionality and Security of LLM   Code Generation

Jinjun Peng; Leyi Cui; Kele Huang; Junfeng Yang; Baishakhi Ray

arXiv:2501.08200·cs.SE·January 15, 2025·2 cites

CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation

Jinjun Peng, Leyi Cui, Kele Huang, Junfeng Yang, Baishakhi Ray

PDF

Open Access 1 Repo

TL;DR

CWEval introduces an outcome-driven framework and benchmark for evaluating both the functionality and security of code generated by large language models, addressing previous benchmarks' limitations and revealing security issues in LLM outputs.

Contribution

The paper presents CWEval, a novel evaluation framework and multilingual benchmark that accurately assesses both functionality and security in LLM-generated code.

Findings

01

A significant portion of functionally correct code is insecure.

02

Previous benchmarks are inaccurate in evaluating security.

03

CWEval reveals security flaws overlooked by prior methods.

Abstract

Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing, enhancing productivity across various tasks. While identifying incorrect code is often straightforward, detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge, which poses considerable security risks of using LLM-generated code and underscores the need for robust evaluation benchmarks that assess both functional correctness and security. Current benchmarks like CyberSecEval and SecurityEval attempt to solve it but are hindered by unclear and impractical specifications, failing to assess both functionality and security accurately. To tackle these deficiencies, we introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

co1lin/cweval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Adversarial Robustness in Machine Learning