Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

Musfiqur Rahman; SayedHassan Khatoonabadi; Emad Shihab

arXiv:2510.26130·cs.SE·November 6, 2025

Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

Musfiqur Rahman, SayedHassan Khatoonabadi, Emad Shihab

PDF

TL;DR

This paper evaluates large language models on real-world class-level code generation tasks, revealing significant performance gaps compared to synthetic benchmarks and highlighting areas for improvement in practical code assistance.

Contribution

Introduces a real-world class-level code generation benchmark from open-source repositories and systematically analyzes LLM performance and failure modes in practical scenarios.

Findings

01

LLMs achieve 84-89% correctness on synthetic benchmarks

02

LLMs only attain 25-34% correctness on real-world class tasks

03

Retrieval augmentation improves correctness by 4-7%

Abstract

Large language models (LLMs) have demonstrated strong performance on function-level code generation benchmarks, yet real-world software development increasingly demands class-level implementations that integrate multiple methods, attributes, and dependencies within authentic project contexts. This gap between benchmark performance and practical utility raises critical questions about LLMs' readiness for production code assistance, particularly regarding their ability to generalize across familiar and novel codebases. We introduce a benchmark derived from real-world open-source repositories, comprising classes divided into seen and unseen partitions to evaluate generalization under practical conditions. We systematically examine how input specification completeness and retrieval-augmented generation affect class-level correctness across multiple state-of-the-art LLMs. Our evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.