The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers
Hussein Mozannar, Valerie Chen, Mohammed Alsobay, Subhro Das,, Sebastian Zhao, Dennis Wei, Manish Nagireddy, Prasanna Sattigeri, Ameet, Talwalkar, David Sontag

TL;DR
This paper introduces RealHumanEval, a human-centric evaluation platform for large language models assisting programmers, revealing that benchmark improvements do not fully translate to real-world productivity gains and highlighting the need for better evaluation proxies.
Contribution
The paper presents RealHumanEval, a new web-based tool for assessing LLMs in real programming tasks, and provides insights into the gap between benchmark performance and actual programmer productivity.
Findings
Benchmark improvements correlate with increased productivity but not proportionally.
Programmer preferences do not align with actual performance.
Gaps between benchmark and human performance persist across support types.
Abstract
Evaluation of large language models for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks or more preferred LLM responses translate to programmer productivity when coding with LLMs, including time spent coding. We introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support. We conducted a user study (N=243) using RealHumanEval in which users interacted with seven LLMs of varying base model performance. Despite static benchmarks not incorporating humans-in-the-loop, we find that improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Online Learning and Analytics · Scientific Computing and Data Management
MethodsBalanced Selection
