StackEval: Benchmarking LLMs in Coding Assistance
Nidhish Shah, Zulkuf Genc, Dogu Araci

TL;DR
This paper introduces two large-scale benchmarks, StackEval and StackUnseen, to evaluate LLMs in coding assistance tasks, including code writing, debugging, review, and understanding, highlighting their capabilities and limitations.
Contribution
The paper presents curated datasets derived from Stack Overflow, including a dynamic recent-content benchmark, and assesses LLMs' evaluation abilities and biases in coding tasks.
Findings
Benchmarks reveal LLMs' strengths and weaknesses in coding tasks.
Evaluation datasets enable assessment of LLMs' judgment and bias.
Public datasets and code support reproducibility and further research.
Abstract
We present two comprehensive benchmarks to evaluate the performance of language models in coding assistance tasks, covering code writing, debugging, code review, and conceptual understanding. Our main contribution includes two curated datasets: StackEval, a large-scale benchmark derived from Stack Overflow questions, and StackUnseen, a dynamic benchmark featuring the most recent Stack Overflow content. These benchmarks offer novel insights into the capabilities and limitations of LLMs, particularly in handling new and emerging content. Additionally, we assess LLMs' proficiency as judges for coding tasks using a curated, human-annotated dataset, exploring their evaluation capabilities and potential biases, including whether they favor their own generated solutions. Our findings underscore the potential of these benchmarks to advance LLM development and application in coding assistance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
