CodeEval: A pedagogical approach for targeted evaluation of code-trained Large Language Models

Danny Brahman; Mohammad Mahoor

arXiv:2601.03432·cs.SE·January 8, 2026

CodeEval: A pedagogical approach for targeted evaluation of code-trained Large Language Models

Danny Brahman, Mohammad Mahoor

PDF

Open Access

TL;DR

This paper introduces CodeEval, a comprehensive benchmarking dataset and framework designed to evaluate large language models' Python coding abilities across multiple proficiency levels and problem types, addressing gaps in existing evaluation methods.

Contribution

The paper presents a novel pedagogical benchmarking approach with a multi-dimensional dataset and an open-source evaluation framework for targeted assessment of code-trained LLMs.

Findings

01

CodeEval covers 24 aspects of Python programming.

02

RunCodeEval facilitates detailed, automated evaluation.

03

Benchmarking reveals specific strengths and weaknesses of LLMs.

Abstract

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated remarkable advancements in logical reasoning, there remains a significant gap in evaluating their code generation capabilities. Existing benchmark datasets fall short in pinpointing specific strengths and weaknesses, impeding targeted enhancements in models' reasoning abilities to synthesize code. To bridge this gap, our paper introduces an innovative, pedagogical benchmarking method that mirrors the evaluation processes encountered in academic programming courses. We introduce CodeEval, a multi-dimensional benchmark dataset designed to rigorously evaluate LLMs across 24 distinct aspects of Python programming. The dataset covers three proficiency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Scientific Computing and Data Management