Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation
Musfiqur Rahman, SayedHassan Khatoonabadi, Emad Shihab

TL;DR
This paper evaluates large language models on real-world class-level code generation tasks, revealing significant performance gaps compared to synthetic benchmarks and highlighting areas for improvement in practical code assistance.
Contribution
Introduces a real-world class-level code generation benchmark from open-source repositories and systematically analyzes LLM performance and failure modes in practical scenarios.
Findings
LLMs achieve 84-89% correctness on synthetic benchmarks
LLMs only attain 25-34% correctness on real-world class tasks
Retrieval augmentation improves correctness by 4-7%
Abstract
Large language models (LLMs) have demonstrated strong performance on function-level code generation benchmarks, yet real-world software development increasingly demands class-level implementations that integrate multiple methods, attributes, and dependencies within authentic project contexts. This gap between benchmark performance and practical utility raises critical questions about LLMs' readiness for production code assistance, particularly regarding their ability to generalize across familiar and novel codebases. We introduce a benchmark derived from real-world open-source repositories, comprising classes divided into seen and unseen partitions to evaluate generalization under practical conditions. We systematically examine how input specification completeness and retrieval-augmented generation affect class-level correctness across multiple state-of-the-art LLMs. Our evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
