Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

Amir Molzam Sharifloo; Maedeh Heydari; Parsa Kazerooni; Daniel Maninger; Mira Mezini

arXiv:2511.04355·cs.SE·November 7, 2025

Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

Amir Molzam Sharifloo, Maedeh Heydari, Parsa Kazerooni, Daniel Maninger, Mira Mezini

PDF

Open Access

TL;DR

This paper analyzes the limitations of large language models in code generation by examining their failures across benchmarks, identifying recurring weaknesses and task complications that hinder their performance.

Contribution

It provides an in-depth analysis of code generation failures in LLMs, revealing common patterns and task complexities that contribute to their struggles.

Findings

01

Identified four recurring failure patterns in LLMs

02

Analyzed 114 tasks to understand common failure causes

03

Highlighted the impact of task complexity on LLM performance

Abstract

Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering quantitative rankings of LLMs. However, they provide limited insight into the tasks that LLMs consistently fail to solve - information that is crucial for understanding current limitations and guiding the development of more capable models. To address this gap, we examined code generation tasks across four popular benchmarks, identifying those that major LLMs are most likely to fail. To understand the causes of these failures, we investigated whether the static complexity of solution code contributes to them, followed by a systematic inspection of 114 tasks that LLMs consistently struggled with. Our analysis revealed four recurring patterns of weaknesses in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Materials Science