StackEval: Benchmarking LLMs in Coding Assistance

Nidhish Shah; Zulkuf Genc; Dogu Araci

arXiv:2412.05288·cs.SE·December 10, 2024·3 cites

StackEval: Benchmarking LLMs in Coding Assistance

Nidhish Shah, Zulkuf Genc, Dogu Araci

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces two large-scale benchmarks, StackEval and StackUnseen, to evaluate LLMs in coding assistance tasks, including code writing, debugging, review, and understanding, highlighting their capabilities and limitations.

Contribution

The paper presents curated datasets derived from Stack Overflow, including a dynamic recent-content benchmark, and assesses LLMs' evaluation abilities and biases in coding tasks.

Findings

01

Benchmarks reveal LLMs' strengths and weaknesses in coding tasks.

02

Evaluation datasets enable assessment of LLMs' judgment and bias.

03

Public datasets and code support reproducibility and further research.

Abstract

We present two comprehensive benchmarks to evaluate the performance of language models in coding assistance tasks, covering code writing, debugging, code review, and conceptual understanding. Our main contribution includes two curated datasets: StackEval, a large-scale benchmark derived from Stack Overflow questions, and StackUnseen, a dynamic benchmark featuring the most recent Stack Overflow content. These benchmarks offer novel insights into the capabilities and limitations of LLMs, particularly in handling new and emerging content. Additionally, we assess LLMs' proficiency as judges for coding tasks using a curated, human-annotated dataset, exploring their evaluation capabilities and potential biases, including whether they favor their own generated solutions. Our findings underscore the potential of these benchmarks to advance LLM development and application in coding assistance.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ProsusAI/stack-eval
noneOfficial

Videos

StackEval: Benchmarking LLMs in Coding Assistance· slideslive

Taxonomy

TopicsNatural Language Processing Techniques