Unseen Horizons: Unveiling the Real Capability of LLM Code Generation   Beyond the Familiar

Yuanliang Zhang; Yifan Xie; Shanshan Li; Ke Liu; Chong Wang; Zhouyang; Jia; Xiangbing Huang; Jie Song; Chaopeng Luo; Zhizheng Zheng; Rulin Xu,; Yitong Liu; Si Zheng; Xiangke Liao

arXiv:2412.08109·cs.SE·January 16, 2025

Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar

Yuanliang Zhang, Yifan Xie, Shanshan Li, Ke Liu, Chong Wang, Zhouyang, Jia, Xiangbing Huang, Jie Song, Chaopeng Luo, Zhizheng Zheng, Rulin Xu,, Yitong Liu, Si Zheng, Xiangke Liao

PDF

Open Access

TL;DR

This paper introduces OBFUSEVAL, a code obfuscation benchmark to accurately evaluate large language models' code generation capabilities by testing them on unseen, obfuscated code, revealing their true generalization ability.

Contribution

The paper proposes a novel obfuscation-based evaluation benchmark for LLMs, addressing dataset exposure and timeliness issues to better assess true code generation capabilities.

Findings

01

Obfuscation reduces test pass rate by up to 62.5%.

02

Current datasets may overestimate LLM capabilities due to exposure.

03

Obfuscation strategies reveal LLMs' limitations in unseen code.

Abstract

Recently, large language models (LLMs) have shown strong potential in code generation tasks. However, there are still gaps before they can be fully applied in actual software development processes. Accurately assessing the code generation capabilities of large language models has become an important basis for evaluating and improving the models. Some existing works have constructed datasets to evaluate the capabilities of these models. However, the current evaluation process may encounter the illusion of "Specialist in Familiarity", primarily due to three gaps: the exposure of target code, case timeliness, and dependency availability. The fundamental reason for these gaps is that the code in current datasets may have been extensively exposed and exercised during the training phase, and due to the continuous training and development of LLM, their timeliness has been severely compromised.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw, AI, and Intellectual Property · Digital Rights Management and Security · Artificial Intelligence in Law