RES-Q: Evaluating Code-Editing Large Language Model Systems at the   Repository Scale

Beck LaBash; August Rosedale; Alex Reents; Lucas Negritto; Colin Wiel

arXiv:2406.16801·cs.CL·June 27, 2024

RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale

Beck LaBash, August Rosedale, Alex Reents, Lucas Negritto, Colin Wiel

PDF

Open Access 1 Repo 1 Datasets

TL;DR

RES-Q is a new benchmark for evaluating large language models' ability to perform repository editing tasks, providing a more comprehensive assessment than traditional benchmarks by using real GitHub commit-based tasks.

Contribution

The paper introduces RES-Q, a novel natural language instruction-based benchmark with real repository editing tasks, and demonstrates its effectiveness in differentiating LLM capabilities.

Findings

01

Claude Sonnet 3.5 outperforms GPT-4o on RES-Q by 12% pass@1.

02

RES-Q can distinguish model capabilities beyond traditional benchmarks.

03

Analysis of token efficiency and model disparities reveals insights into LLM performance.

Abstract

The instruction-following ability of Large Language Models (LLMs) has cultivated a class of LLM-based systems capable of approaching complex tasks such as making edits to large code repositories. Due to the high sensitivity and unpredictability of LLM behavior in response to changes in prompting, robust evaluation tools are needed to drive future iteration of these systems. We propose RES-Q, a natural language instruction-based benchmark for evaluating $R$ epository $E$ diting $S$ ystems, which consists of 100 handcrafted repository editing tasks derived from real GitHub commits. Given an edit instruction and a code repository, RES-Q evaluates an LLM system's ability to interpret the instruction, navigate the repository to gather relevant information, and construct an appropriate edit that satisfies the specified criteria. We argue that evaluating LLMs in this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qurrent-ai/res-q
noneOfficial

Datasets

Qurrent/RES-Q
dataset· 325 dl
325 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Model-Driven Software Engineering Techniques