Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

Yutao Mou; Xiao Deng; Yuxiao Luo; Shikun Zhang; Wei Ye

arXiv:2505.10494·cs.CL·May 16, 2025

Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

Yutao Mou, Xiao Deng, Yuxiao Luo, Shikun Zhang, Wei Ye

PDF

Open Access 1 Repo

TL;DR

This paper introduces CoV-Eval, a comprehensive benchmark for evaluating large language models' ability to generate, detect, and repair secure code, revealing current limitations and guiding future improvements.

Contribution

It presents CoV-Eval, a multi-task benchmark, and VC-Judge, an improved vulnerability review model, advancing the assessment of LLMs in code security.

Findings

01

Most LLMs detect vulnerabilities well

02

LLMs often generate insecure code

03

Challenges remain in recognizing and repairing specific vulnerabilities

Abstract

Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

murraytom/cov-eval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research

MethodsFocus