The RealHumanEval: Evaluating Large Language Models' Abilities to   Support Programmers

Hussein Mozannar; Valerie Chen; Mohammed Alsobay; Subhro Das,; Sebastian Zhao; Dennis Wei; Manish Nagireddy; Prasanna Sattigeri; Ameet; Talwalkar; David Sontag

arXiv:2404.02806·cs.SE·October 16, 2024·1 cites

The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

Hussein Mozannar, Valerie Chen, Mohammed Alsobay, Subhro Das,, Sebastian Zhao, Dennis Wei, Manish Nagireddy, Prasanna Sattigeri, Ameet, Talwalkar, David Sontag

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces RealHumanEval, a human-centric evaluation platform for large language models assisting programmers, revealing that benchmark improvements do not fully translate to real-world productivity gains and highlighting the need for better evaluation proxies.

Contribution

The paper presents RealHumanEval, a new web-based tool for assessing LLMs in real programming tasks, and provides insights into the gap between benchmark performance and actual programmer productivity.

Findings

01

Benchmark improvements correlate with increased productivity but not proportionally.

02

Programmer preferences do not align with actual performance.

03

Gaps between benchmark and human performance persist across support types.

Abstract

Evaluation of large language models for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks or more preferred LLM responses translate to programmer productivity when coding with LLMs, including time spent coding. We introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support. We conducted a user study (N=243) using RealHumanEval in which users interacted with seven LLMs of varying base model performance. Despite static benchmarks not incorporating humans-in-the-loop, we find that improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clinicalml/realhumaneval
noneOfficial

Datasets

hsseinmz/realhumaneval
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Online Learning and Analytics · Scientific Computing and Data Management

MethodsBalanced Selection