CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models

Cheng Cheng; Jinqiu Yang

arXiv:2512.06248·cs.SE·December 9, 2025

CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models

Cheng Cheng, Jinqiu Yang

PDF

Open Access

TL;DR

CFCEval is a comprehensive framework that improves the evaluation of code generated by large language models by addressing dataset bias, introducing a new metric ELRM, and assessing multiple quality and security dimensions.

Contribution

The paper introduces CFCEval, a novel evaluation framework with a new benchmark MLVBench and the ELRM metric, enhancing assessment of code quality and security in LLM-generated code.

Findings

01

CFCEval better captures code quality and security aspects.

02

ELRM aligns more closely with human judgments than CodeBLEU.

03

The framework addresses dataset bias and evaluation shortcomings.

Abstract

Code-focused Large Language Models (LLMs), such as CodeX and Star-Coder, have demonstrated remarkable capabilities in enhancing developer productivity through context-aware code generation. However, evaluating the quality and security of LLM-generated code remains a significant challenge. Existing evaluation protocols for Code LLMs lack both methodological rigor and comprehensive scope. A key limitation is dataset bias, which arises from unintentional overlap between training and testing data. Furthermore, while CodeBLEU, a BLEU-based metric, is widely used to assess code similarity, it suffers from critical shortcomings, including imprecise tokenization, structural limitations, and low reference diversity. To address these challenges, we introduce CFCEval, a novel framework for evaluating the quality and security of code generated by LLMs. CFCEval mitigates dataset bias by creating a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Testing and Debugging Techniques