TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models
Florian Tambon, Amin Nikanjam, Cyrine Zid, Foutse Khomh, Giuliano Antoniol

TL;DR
TaskEval introduces a novel framework using diverse prompts and Item Response Theory to assess the difficulty of code generation tasks for large language models, providing deeper insights into task properties and model performance.
Contribution
The paper presents TaskEval, a new approach that characterizes task difficulty and properties using diverse prompts and IRT, enhancing benchmark evaluation of LLMs in code generation.
Findings
TaskEval effectively characterizes task properties and difficulty levels.
Identifies 17 and 21 topics within code benchmarks.
Reveals patterns linking task difficulty with programming constructs.
Abstract
Large Language Models (LLMs) excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty. Moreover, benchmarks are usually built using tasks described with a single prompt, despite the formulation of prompts having a profound impact on the outcome. This paper introduces a generalist approach, TaskEval, a framework using diverse prompts and Item Response Theory (IRT) to efficiently assess LLMs' capabilities and benchmark task characteristics, improving the understanding of their performance. Using two code generation benchmarks, \textit{HumanEval}+ and \textit{ClassEval}, as well as 8 code generation LLMs, we show that \textit{TaskEval} is capable of characterising the properties of tasks. Using topic analysis, we identify and analyse the tasks of 17 and 21 topics within the benchmarks. We also cross-analyse tasks'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Software System Performance and Reliability
