User Centric Evaluation of Code Generation Tools
Tanha Miah, Hong Zhu

TL;DR
This paper introduces a user-centric evaluation method for code generation tools like LLMs, focusing on usability and user experience, demonstrated through a case study on ChatGPT for R programming.
Contribution
It proposes a novel usability-focused evaluation framework incorporating metadata, multi-attempt testing, and user experience metrics, filling a gap in LLM assessment beyond capability comparison.
Findings
ChatGPT is highly useful for R code generation.
Average user attempts per task are 1.61.
Usability is weakest in conciseness, scoring 3.80/5.
Abstract
With the rapid advance of machine learning (ML) technology, large language models (LLMs) are increasingly explored as an intelligent tool to generate program code from natural language specifications. However, existing evaluations of LLMs have focused on their capabilities in comparison with humans. It is desirable to evaluate their usability when deciding on whether to use a LLM in software production. This paper proposes a user centric method for this purpose. It includes metadata in the test cases of a benchmark to describe their usages, conducts testing in a multi-attempt process that mimics the uses of LLMs, measures LLM generated solutions on a set of quality attributes that reflect usability, and evaluates the performance based on user experiences in the uses of LLMs as a tool. The paper also reports a case study with the method in the evaluation of ChatGPT's usability as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)
MethodsSparse Evolutionary Training
