Narrowing the Complexity Gap in the Evaluation of Large Language Models

Yang Chen; Shuyang Liu; Reyhaneh Jabbarvand

arXiv:2602.18928·cs.SE·February 24, 2026

Narrowing the Complexity Gap in the Evaluation of Large Language Models

Yang Chen, Shuyang Liu, Reyhaneh Jabbarvand

PDF

Open Access

TL;DR

This paper introduces GeneBench, an automated method to add real-world complexity to programming benchmarks, revealing that LLMs' performance drops significantly under more realistic conditions, thus providing a more accurate evaluation of their capabilities.

Contribution

GeneBench offers a novel, automated multi-objective optimization approach to enhance benchmark complexity, addressing data contamination and overfitting issues in LLM evaluation.

Findings

01

LLMs' performance drops by 14.9%-60.5% on complex benchmarks

02

Performance decline persists even with few-shot prompting or fine-tuning

03

GeneBench's results are consistent with real-world bug repair performance

Abstract

Evaluating Large Language Models (LLMs) with respect to real-world code complexity is essential. Otherwise, there is a risk of overestimating LLMs' programming abilities based on simplistic benchmarks, only to be disappointed when using them in real-world settings. Recently, researchers explored the construction of more realistic benchmarks by mining or augmenting open-source repositories. Such solutions are usually task-specific. Data quality control from real-world projects can also be time-consuming and error-prone. More importantly, evaluating LLMs on fixed benchmark problems is subject to data contamination and overfitting. We propose GeneBench, an automated technique to add real-world complexities to any programming benchmark. GeneBench leverages a multi-objective optimization to increase the complexity of programming problems while maintaining the readability of code similar to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques · Topic Modeling