RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale
Beck LaBash, August Rosedale, Alex Reents, Lucas Negritto, Colin Wiel

TL;DR
RES-Q is a new benchmark for evaluating large language models' ability to perform repository editing tasks, providing a more comprehensive assessment than traditional benchmarks by using real GitHub commit-based tasks.
Contribution
The paper introduces RES-Q, a novel natural language instruction-based benchmark with real repository editing tasks, and demonstrates its effectiveness in differentiating LLM capabilities.
Findings
Claude Sonnet 3.5 outperforms GPT-4o on RES-Q by 12% pass@1.
RES-Q can distinguish model capabilities beyond traditional benchmarks.
Analysis of token efficiency and model disparities reveals insights into LLM performance.
Abstract
The instruction-following ability of Large Language Models (LLMs) has cultivated a class of LLM-based systems capable of approaching complex tasks such as making edits to large code repositories. Due to the high sensitivity and unpredictability of LLM behavior in response to changes in prompting, robust evaluation tools are needed to drive future iteration of these systems. We propose RES-Q, a natural language instruction-based benchmark for evaluating epository diting ystems, which consists of 100 handcrafted repository editing tasks derived from real GitHub commits. Given an edit instruction and a code repository, RES-Q evaluates an LLM system's ability to interpret the instruction, navigate the repository to gather relevant information, and construct an appropriate edit that satisfies the specified criteria. We argue that evaluating LLMs in this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Model-Driven Software Engineering Techniques
