Narrowing the Complexity Gap in the Evaluation of Large Language Models
Yang Chen, Shuyang Liu, Reyhaneh Jabbarvand

TL;DR
This paper introduces GeneBench, an automated method to add real-world complexity to programming benchmarks, revealing that LLMs' performance drops significantly under more realistic conditions, thus providing a more accurate evaluation of their capabilities.
Contribution
GeneBench offers a novel, automated multi-objective optimization approach to enhance benchmark complexity, addressing data contamination and overfitting issues in LLM evaluation.
Findings
LLMs' performance drops by 14.9%-60.5% on complex benchmarks
Performance decline persists even with few-shot prompting or fine-tuning
GeneBench's results are consistent with real-world bug repair performance
Abstract
Evaluating Large Language Models (LLMs) with respect to real-world code complexity is essential. Otherwise, there is a risk of overestimating LLMs' programming abilities based on simplistic benchmarks, only to be disappointed when using them in real-world settings. Recently, researchers explored the construction of more realistic benchmarks by mining or augmenting open-source repositories. Such solutions are usually task-specific. Data quality control from real-world projects can also be time-consuming and error-prone. More importantly, evaluating LLMs on fixed benchmark problems is subject to data contamination and overfitting. We propose GeneBench, an automated technique to add real-world complexities to any programming benchmark. GeneBench leverages a multi-objective optimization to increase the complexity of programming problems while maintaining the readability of code similar to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Natural Language Processing Techniques · Topic Modeling
